Skip to content

Preprocessing fails in Perform mode #801

@mezis

Description

@mezis

Preprocessing seems to fail with text data.
This happens only in Perform mode, not Explain mode. In particular but not only with XgBoost.

Console output:

The task is binary_classification with evaluation metric average_precision
AutoML will use algorithms: ['Xgboost']
AutoML will ensemble available models
AutoML steps: ['simple_algorithms', 'default_algorithms', 'not_so_random', 'golden_features', 'insert_random_feature', 'features_selection', 'hill_climbing_1', 'hill_climbing_2', 'ensemble']
Skip simple_algorithms because no parameters were generated.
* Step default_algorithms will try to check up to 1 model
There was an error during 1_Default_Xgboost training.
Please check automl_results/signup_1hour-profile_changes_v2-bad_actors_v3-20250621-20250721/Perform/errors.md for details.

Error file contents:

## Error for 1_Default_Xgboost

Found array with 0 sample(s) (shape=(0, 100)) while a minimum of 1 is required by TfidfTransformer.
Traceback (most recent call last):
  File ".venv/lib/python3.10/site-packages/supervised/base_automl.py", line 1183, in _fit
    trained = self.train_model(params)
  File ".venv/lib/python3.10/site-packages/supervised/base_automl.py", line 391, in train_model
    self.keep_model(mf, model_subpath)
  File ".venv/lib/python3.10/site-packages/supervised/base_automl.py", line 294, in keep_model
    self._base_predict(self._one_sample, model)
  File ".venv/lib/python3.10/site-packages/supervised/base_automl.py", line 1474, in _base_predict
    predictions = model.predict(X)
  File ".venv/lib/python3.10/site-packages/supervised/model_framework.py", line 447, in predict
    X_data, _, _ = self.preprocessings[ind].transform(X.copy(), None)
  File ".venv/lib/python3.10/site-packages/supervised/preprocessing/preprocessing.py", line 360, in transform
    X_validation = tt.transform(X_validation)
  File ".venv/lib/python3.10/site-packages/supervised/preprocessing/text_transformer.py", line 36, in transform
    vect = self._vectorizer.transform(x)
  File ".venv/lib/python3.10/site-packages/sklearn/feature_extraction/text.py", line 2129, in transform
    return self._tfidf.transform(X, copy=False)
  File ".venv/lib/python3.10/site-packages/sklearn/feature_extraction/text.py", line 1700, in transform
    X = validate_data(
  File ".venv/lib/python3.10/site-packages/sklearn/utils/validation.py", line 2954, in validate_data
    out = check_array(X, input_name="X", **check_params)
  File ".venv/lib/python3.10/site-packages/sklearn/utils/validation.py", line 1128, in check_array
    raise ValueError(
ValueError: Found array with 0 sample(s) (shape=(0, 100)) while a minimum of 1 is required by TfidfTransformer.


Please set a GitHub issue with above error message at: https://github.com/mljar/mljar-supervised/issues/new

In Explain mode things are fine:

AutoML directory: automl_results/signup_1hour-profile_changes_v2-bad_actors_v3-20250621-20250721/Explain
The task is binary_classification with evaluation metric average_precision
AutoML will use algorithms: ['Xgboost']
AutoML will ensemble available models
AutoML steps: ['simple_algorithms', 'default_algorithms', 'ensemble']
Skip simple_algorithms because no parameters were generated.
* Step default_algorithms will try to check up to 1 model
1_Default_Xgboost average_precision 0.802009 trained in 31.98 seconds
* Step ensemble will try to check up to 1 model
AutoML fit time: 35.37 seconds
AutoML best model: 1_Default_Xgboost
2025-07-24 09:14:25,482 automl_trainer.train INFO Evaluating model on test set...
2025-07-24 09:14:27,847 automl_trainer.train INFO Test Accuracy: 0.9549
2025-07-24 09:14:29,984 automl_trainer.train INFO Test PR AUC (Average Precision): 0.8071
2025-07-24 09:14:29,989 automl_trainer.train INFO Test ROC AUC: 0.9621
2025-07-24 09:14:29,989 automl_trainer.train INFO Training optimized for: average_precision

Versions:

$ pip list | grep -E "(mljar-supervised|pandas|scikit-learn|numpy|click)"
click              8.2.1
mljar-supervised   1.1.18
numpy              1.26.4
pandas             2.3.1
scikit-learn       1.7.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions