Preprocessing fails in Perform mode

Preprocessing seems to fail with text data.
This happens only in Perform mode, not Explain mode. In particular but not only with XgBoost.

Console output:

```
The task is binary_classification with evaluation metric average_precision
AutoML will use algorithms: ['Xgboost']
AutoML will ensemble available models
AutoML steps: ['simple_algorithms', 'default_algorithms', 'not_so_random', 'golden_features', 'insert_random_feature', 'features_selection', 'hill_climbing_1', 'hill_climbing_2', 'ensemble']
Skip simple_algorithms because no parameters were generated.
* Step default_algorithms will try to check up to 1 model
There was an error during 1_Default_Xgboost training.
Please check automl_results/signup_1hour-profile_changes_v2-bad_actors_v3-20250621-20250721/Perform/errors.md for details.
```

Error file contents:

```
## Error for 1_Default_Xgboost

Found array with 0 sample(s) (shape=(0, 100)) while a minimum of 1 is required by TfidfTransformer.
Traceback (most recent call last):
  File ".venv/lib/python3.10/site-packages/supervised/base_automl.py", line 1183, in _fit
    trained = self.train_model(params)
  File ".venv/lib/python3.10/site-packages/supervised/base_automl.py", line 391, in train_model
    self.keep_model(mf, model_subpath)
  File ".venv/lib/python3.10/site-packages/supervised/base_automl.py", line 294, in keep_model
    self._base_predict(self._one_sample, model)
  File ".venv/lib/python3.10/site-packages/supervised/base_automl.py", line 1474, in _base_predict
    predictions = model.predict(X)
  File ".venv/lib/python3.10/site-packages/supervised/model_framework.py", line 447, in predict
    X_data, _, _ = self.preprocessings[ind].transform(X.copy(), None)
  File ".venv/lib/python3.10/site-packages/supervised/preprocessing/preprocessing.py", line 360, in transform
    X_validation = tt.transform(X_validation)
  File ".venv/lib/python3.10/site-packages/supervised/preprocessing/text_transformer.py", line 36, in transform
    vect = self._vectorizer.transform(x)
  File ".venv/lib/python3.10/site-packages/sklearn/feature_extraction/text.py", line 2129, in transform
    return self._tfidf.transform(X, copy=False)
  File ".venv/lib/python3.10/site-packages/sklearn/feature_extraction/text.py", line 1700, in transform
    X = validate_data(
  File ".venv/lib/python3.10/site-packages/sklearn/utils/validation.py", line 2954, in validate_data
    out = check_array(X, input_name="X", **check_params)
  File ".venv/lib/python3.10/site-packages/sklearn/utils/validation.py", line 1128, in check_array
    raise ValueError(
ValueError: Found array with 0 sample(s) (shape=(0, 100)) while a minimum of 1 is required by TfidfTransformer.


Please set a GitHub issue with above error message at: https://github.com/mljar/mljar-supervised/issues/new
```

In Explain mode things are fine:

```
AutoML directory: automl_results/signup_1hour-profile_changes_v2-bad_actors_v3-20250621-20250721/Explain
The task is binary_classification with evaluation metric average_precision
AutoML will use algorithms: ['Xgboost']
AutoML will ensemble available models
AutoML steps: ['simple_algorithms', 'default_algorithms', 'ensemble']
Skip simple_algorithms because no parameters were generated.
* Step default_algorithms will try to check up to 1 model
1_Default_Xgboost average_precision 0.802009 trained in 31.98 seconds
* Step ensemble will try to check up to 1 model
AutoML fit time: 35.37 seconds
AutoML best model: 1_Default_Xgboost
2025-07-24 09:14:25,482 automl_trainer.train INFO Evaluating model on test set...
2025-07-24 09:14:27,847 automl_trainer.train INFO Test Accuracy: 0.9549
2025-07-24 09:14:29,984 automl_trainer.train INFO Test PR AUC (Average Precision): 0.8071
2025-07-24 09:14:29,989 automl_trainer.train INFO Test ROC AUC: 0.9621
2025-07-24 09:14:29,989 automl_trainer.train INFO Training optimized for: average_precision
```


Versions:

```
$ pip list | grep -E "(mljar-supervised|pandas|scikit-learn|numpy|click)"
click              8.2.1
mljar-supervised   1.1.18
numpy              1.26.4
pandas             2.3.1
scikit-learn       1.7.1
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocessing fails in Perform mode #801

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Preprocessing fails in Perform mode #801

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions