Conversation
|
I spent a bit of time on the CPU dask tests. Fixtures werent being reused, so now the tests are 3 times faster on my machine. |
There was a problem hiding this comment.
Pull request overview
This PR speeds up Python tests by introducing session-scoped Dask client fixtures and reducing test data sizes. The changes eliminate redundant cluster creation/teardown operations across tests, while maintaining test coverage and validity.
Changes:
- Introduced shared session-scoped Dask client fixtures in a new
conftest.pyfile to avoid repeated cluster setup - Reduced test data sizes and iteration counts (e.g., 10000→3000 samples, 16→8 boost rounds, hypothesis max_examples reduced)
- Updated test functions to use shared fixtures instead of creating their own clusters
- Enhanced
npstr_to_arrow_strarrin_data_utils.pyto handle more array types robustly
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_distributed/test_with_dask/conftest.py | New file with session-scoped client fixtures to share Dask clusters across tests |
| tests/test_distributed/test_with_dask/test_with_dask.py | Removed local fixtures, updated tests to use shared clients, reduced data sizes, improved compatibility with sklearn versions |
| tests/test_distributed/test_with_dask/test_ranking.py | Removed duplicate fixtures, adjusted training parameters for faster convergence |
| tests/python/test_updaters.py | Reduced hypothesis test example counts for faster execution |
| tests/python/test_ranking.py | Reduced hypothesis test example counts for faster execution |
| python-package/xgboost/testing/updater.py | Reduced boost rounds for faster test execution |
| python-package/xgboost/testing/dask.py | Reduced feature counts and partition sizes for faster test execution |
| python-package/xgboost/_data_utils.py | Enhanced string array handling to support more input types (pandas, byte strings) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| import socket | ||
| import tempfile | ||
| from concurrent.futures import ThreadPoolExecutor | ||
| from contextlib import ExitStack |
There was a problem hiding this comment.
The ExitStack import is unused and should be removed.
| from contextlib import ExitStack |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
I don't think I've made much of a dent in overall CI time but its something. |
|
Please let me know if the PR is ready. Thus far, the time for CPU tests is mostly spent on the pyspark tests: https://github.com/dmlc/xgboost/actions/runs/21631091583/job/62344616444 For GPU tests, I need to reduce the time on cuDF tests with the ordinal re-coder. cuDF is quite slow when the number of columns is large. |
|
Yes its ready. Maybe we can also improve the fixturing for spark. |
|
|
||
| def npstr_to_arrow_strarr(strarr: np.ndarray) -> Tuple[np.ndarray, str]: | ||
| """Convert a numpy string array to an arrow string array.""" | ||
| def npstr_to_arrow_strarr(strarr: Any) -> Tuple[np.ndarray, str]: |
There was a problem hiding this comment.
With my dependencies strarr was not an np.ndarray but rather an arrow data structure.
There was a problem hiding this comment.
Okay, you are using pandas 3.0 and have pyarrow in your environment.
| import socket | ||
| import tempfile | ||
| from concurrent.futures import ThreadPoolExecutor | ||
| from contextlib import ExitStack |
There was a problem hiding this comment.
ExitStack and Generator imports are not used.
|
|
||
| def npstr_to_arrow_strarr(strarr: np.ndarray) -> Tuple[np.ndarray, str]: | ||
| """Convert a numpy string array to an arrow string array.""" | ||
| def npstr_to_arrow_strarr(strarr: Any) -> Tuple[np.ndarray, str]: |
There was a problem hiding this comment.
Okay, you are using pandas 3.0 and have pyarrow in your environment.
No description provided.