Speed up python tests by RAMitchell · Pull Request #11975 · dmlc/xgboost

RAMitchell · 2026-01-29T15:06:19Z

No description provided.

RAMitchell · 2026-01-30T14:45:53Z

I spent a bit of time on the CPU dask tests. Fixtures werent being reused, so now the tests are 3 times faster on my machine.

Copilot

Pull request overview

This PR speeds up Python tests by introducing session-scoped Dask client fixtures and reducing test data sizes. The changes eliminate redundant cluster creation/teardown operations across tests, while maintaining test coverage and validity.

Changes:

Introduced shared session-scoped Dask client fixtures in a new conftest.py file to avoid repeated cluster setup
Reduced test data sizes and iteration counts (e.g., 10000→3000 samples, 16→8 boost rounds, hypothesis max_examples reduced)
Updated test functions to use shared fixtures instead of creating their own clusters
Enhanced npstr_to_arrow_strarr in _data_utils.py to handle more array types robustly

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tests/test_distributed/test_with_dask/conftest.py	New file with session-scoped client fixtures to share Dask clusters across tests
tests/test_distributed/test_with_dask/test_with_dask.py	Removed local fixtures, updated tests to use shared clients, reduced data sizes, improved compatibility with sklearn versions
tests/test_distributed/test_with_dask/test_ranking.py	Removed duplicate fixtures, adjusted training parameters for faster convergence
tests/python/test_updaters.py	Reduced hypothesis test example counts for faster execution
tests/python/test_ranking.py	Reduced hypothesis test example counts for faster execution
python-package/xgboost/testing/updater.py	Reduced boost rounds for faster test execution
python-package/xgboost/testing/dask.py	Reduced feature counts and partition sizes for faster test execution
python-package/xgboost/_data_utils.py	Enhanced string array handling to support more input types (pandas, byte strings)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/test_distributed/test_with_dask/conftest.py

Copilot · 2026-01-30T14:50:31Z

tests/test_distributed/test_with_dask/test_with_dask.py

 import socket
 import tempfile
 from concurrent.futures import ThreadPoolExecutor
+from contextlib import ExitStack


The ExitStack import is unused and should be removed.

Suggested change

from contextlib import ExitStack

tests/test_distributed/test_with_dask/test_with_dask.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

RAMitchell · 2026-02-02T09:22:19Z

I don't think I've made much of a dent in overall CI time but its something.

trivialfis · 2026-02-03T14:13:10Z

Please let me know if the PR is ready.

Thus far, the time for CPU tests is mostly spent on the pyspark tests: https://github.com/dmlc/xgboost/actions/runs/21631091583/job/62344616444

For GPU tests, I need to reduce the time on cuDF tests with the ordinal re-coder. cuDF is quite slow when the number of columns is large.

RAMitchell · 2026-02-04T08:20:31Z

Yes its ready.

Maybe we can also improve the fixturing for spark.

21.06s call     test_distributed/test_with_spark/test_spark_local.py::TestPySparkLocal::test_classifier_with_cross_validator
19.48s call     test_distributed/test_with_spark/test_spark_local_cluster.py::XgboostLocalClusterTestCase::test_classifier_distributed_weight_eval
18.90s call     test_distributed/test_with_spark/test_spark_local.py::TestPySparkLocal::test_classifier_with_weight_eval
18.55s call     test_distributed/test_with_spark/test_spark_local_cluster.py::XgboostLocalClusterTestCase::test_regressor_distributed_weight_eval
18.47s call     test_distributed/test_with_spark/test_spark_local.py::XgboostLocalTest::test_classifier_with_base_margin
18.39s call     test_distributed/test_with_spark/test_spark_local.py::TestPySparkLocal::test_regressor_with_weight_eval
17.65s call     test_distributed/test_with_spark/test_spark_local.py::XgboostLocalTest::test_train_with_initial_model
16.47s setup    test_distributed/test_with_spark/test_spark_local_cluster.py::XgboostLocalClusterTestCase::test_classifier_distributed_basic
13.64s call     test_distributed/test_with_spark/test_spark_local_cluster.py::TestPySparkLocalCluster::test_regressor_basic_with_params
12.88s call     test_distributed/test_with_spark/test_spark_local.py::XgboostLocalTest::test_empty_validation_data
12.82s call     test_distributed/test_with_spark/test_spark_local_cluster.py::TestPySparkLocalCluster::test_callbacks
12.41s call     test_distributed/test_with_spark/test_spark_local.py::XgboostLocalTest::test_classifier_with_sparse_optim
11.97s call     test_distributed/test_with_spark/test_spark_local.py::XgboostLocalTest::test_regressor_with_sparse_optim
11.36s call     test_distributed/test_with_spark/test_spark_local.py::XgboostLocalTest::test_empty_partition
10.22s call     test_distributed/test_with_spark/test_spark_local_cluster.py::XgboostLocalClusterTestCase::test_classifier_distributed_basic
9.66s call     test_distributed/test_with_spark/test_spark_local.py::TestPySparkLocal::test_regressor_basic
8.54s call     test_distributed/test_with_spark/test_spark_local.py::TestPySparkLocal::test_gpu_transform
7.71s call     test_distributed/test_with_spark/test_spark_local.py::TestPySparkLocal::test_classifier_model_pipeline_save_load
7.52s call     test_distributed/test_with_spark/test_spark_local.py::TestPySparkLocal::test_classifier_model_save_load
7.42s call     test_distributed/test_with_spark/test_spark_local.py::TestPySparkLocal::test_regressor_model_pipeline_save_load

trivialfis · 2026-02-04T08:37:01Z

python-package/xgboost/_data_utils.py


-    def npstr_to_arrow_strarr(strarr: np.ndarray) -> Tuple[np.ndarray, str]:
-        """Convert a numpy string array to an arrow string array."""
+    def npstr_to_arrow_strarr(strarr: Any) -> Tuple[np.ndarray, str]:


What's being changed here?

With my dependencies strarr was not an np.ndarray but rather an arrow data structure.

Okay, you are using pandas 3.0 and have pyarrow in your environment.

trivialfis · 2026-02-04T17:18:42Z

tests/test_distributed/test_with_dask/test_with_dask.py

 import socket
 import tempfile
 from concurrent.futures import ThreadPoolExecutor
+from contextlib import ExitStack


ExitStack and Generator imports are not used.

trivialfis · 2026-02-04T17:23:01Z

python-package/xgboost/_data_utils.py


-    def npstr_to_arrow_strarr(strarr: np.ndarray) -> Tuple[np.ndarray, str]:
-        """Convert a numpy string array to an arrow string array."""
+    def npstr_to_arrow_strarr(strarr: Any) -> Tuple[np.ndarray, str]:


Okay, you are using pandas 3.0 and have pyarrow in your environment.

RAMitchell added 3 commits January 29, 2026 06:56

Speed up python tests

86873f0

Reuse clients in dask tests

e2458b4

Merge branch 'master' of github.com:dmlc/xgboost into slow_python_tests

2e32eee

RAMitchell marked this pull request as ready for review January 30, 2026 09:58

RAMitchell added 4 commits January 30, 2026 04:16

Merge branch 'master' of github.com:dmlc/xgboost into slow_python_tests

395dcec

Fix test failure

3ac20d9

Faster dask tests

af21a91

Simplify fixtures

2b45919

RAMitchell requested a review from Copilot January 30, 2026 14:45

Copilot started reviewing on behalf of RAMitchell January 30, 2026 14:46 View session

Copilot AI reviewed Jan 30, 2026

View reviewed changes

RAMitchell and others added 3 commits February 2, 2026 10:21

Update tests/test_distributed/test_with_dask/conftest.py

e4e8cd0

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update tests/test_distributed/test_with_dask/conftest.py

d4a1a81

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update tests/test_distributed/test_with_dask/test_with_dask.py

3f0fce0

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

RAMitchell added 3 commits February 3, 2026 10:30

Merge branch 'master' into slow_python_tests

0e0cb0b

Exclude tests with pylint

b2d691e

Revert some changes

aea059c

trivialfis reviewed Feb 4, 2026

View reviewed changes

RAMitchell added 2 commits February 5, 2026 00:55

Unused imports

f0c3085

Merge branch 'master' of github.com:dmlc/xgboost into slow_python_tests

fd19f66

trivialfis approved these changes Feb 5, 2026

View reviewed changes

RAMitchell added 2 commits February 5, 2026 04:44

Ruff

c6ed61d

Ruff again

d8df444

RAMitchell merged commit f816eae into dmlc:master Feb 5, 2026
78 checks passed

Uh oh!

Conversation

RAMitchell commented Jan 29, 2026

Uh oh!

RAMitchell commented Jan 30, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

RAMitchell commented Feb 2, 2026

Uh oh!

trivialfis commented Feb 3, 2026

Uh oh!

RAMitchell commented Feb 4, 2026

Uh oh!

trivialfis Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

RAMitchell Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

trivialfis Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

trivialfis Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

trivialfis Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants