Skip to content

Commit 5aa2d17

Browse files
authored
Add full test suite workflow (#7795)
We have been disabled the full unit test workflow for a while. This PR migrates the full test to our AWS test infra. To make the tests pass, we need to merge these PRs: - #7786 - #7788 - #7789 - #7790 - #7793 - #7794 In addition having these PRs merged, this PR has the following changes in the full test workflow and test harness: - Ignore flags for some known issues: - nvme: Requires an actual NVMe device. Our CI currently doesn't have NVMe storage configured - GDS: GDS requires special kernel drivers and NVIDIA Magnum IO to enable direct GPU-to-storage transfers. CI instances don't have this configured. - Zenflow: 1. Stage 3 bugs: The ZenFlow + ZeRO Stage 3 implementation has pre-existing bugs that cause internal pytest errors and worker crashes, 2. CUDA/fork incompatibility: test_zf_torch_adam.py uses torch.optim.AdamW which does CUDA graph capture checks that fail in forked processes (--forked flag, we can just move it to sequential tests) - `/mnt/aio` mount for async I/O tests - CUTLASS installation for Evoformer tests - Add `DS_DISABLE_REUSE_DIST_ENV` to the test harness to prevent worker cleanup hangs Once we merge this PR, we will be able to run the full test manually or at scheduled times. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
1 parent 09885ef commit 5aa2d17

File tree

2 files changed

+142
-0
lines changed

2 files changed

+142
-0
lines changed
Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
################################################################################
2+
# DeepSpeed CI - AWS L40S GPU Full Tests (PyTorch Latest)
3+
#
4+
# Runs the full DeepSpeed unit test suite on AWS self-hosted runners.
5+
# Uses 4x NVIDIA L40S GPUs on g6e.12xlarge instances.
6+
#
7+
# This workflow runs:
8+
# - Parallel tests with pytest-xdist (-n 8)
9+
# - Sequential tests marked with @pytest.mark.sequential
10+
################################################################################
11+
12+
name: aws-torch-latest-full
13+
14+
on:
15+
workflow_dispatch:
16+
17+
concurrency:
18+
group: ${{ github.workflow }}-${{ github.ref }}
19+
cancel-in-progress: true
20+
21+
jobs:
22+
unit-tests:
23+
name: Unit Tests (Full)
24+
runs-on: [self-hosted, gpu-ci, gpu-l40s, l40s-4gpu, aws]
25+
timeout-minutes: 180
26+
27+
container:
28+
image: nvidia/cuda:12.6.3-devel-ubuntu22.04
29+
# Mount /mnt/aio for async I/O tests (O_DIRECT requires native filesystem, not overlayfs)
30+
options: --gpus all --shm-size "32G" -v /mnt/aio:/mnt/aio
31+
32+
env:
33+
TORCH_VER: "2.7"
34+
CUDA_VER: "12.6"
35+
CUTLASS_PATH: /opt/cutlass
36+
# Disable reuse_dist_env to prevent pool worker cleanup hangs in full test runs
37+
DS_DISABLE_REUSE_DIST_ENV: "1"
38+
39+
steps:
40+
- name: Install system dependencies
41+
run: |
42+
apt-get update && apt-get install -y git git-lfs libaio-dev pdsh python3 python3-pip
43+
git lfs install
44+
ln -sf /usr/bin/python3 /usr/bin/python
45+
46+
- name: Checkout repository
47+
uses: actions/checkout@v4
48+
with:
49+
lfs: true
50+
51+
- name: Install CUTLASS
52+
run: |
53+
git clone --depth 1 --branch v3.5.1 https://github.com/NVIDIA/cutlass.git /opt/cutlass
54+
echo "CUTLASS installed at /opt/cutlass"
55+
ls -la /opt/cutlass/include/ | head -10
56+
57+
- name: Install PyTorch
58+
run: |
59+
pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 --index-url https://download.pytorch.org/whl/cu126
60+
61+
- name: Install transformers
62+
run: |
63+
git clone https://github.com/huggingface/transformers
64+
cd transformers
65+
git checkout 981c276
66+
pip install .
67+
68+
- name: Install Python dependencies
69+
run: |
70+
pip install --upgrade pip
71+
pip install -r requirements/requirements.txt
72+
pip install -r requirements/requirements-dev.txt
73+
pip install -r requirements/requirements-deepcompile.txt
74+
pip install pytest-timeout pytest-instafail
75+
76+
- name: Check environment
77+
run: |
78+
echo "=== GPU Information ==="
79+
nvidia-smi
80+
echo ""
81+
echo "=== CUDA Version ==="
82+
nvcc --version
83+
echo ""
84+
echo "=== Python/PyTorch Info ==="
85+
python --version
86+
python -c "import torch; print(f'PyTorch: {torch.__version__}')"
87+
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
88+
python -c "import torch; print(f'CUDA devices: {torch.cuda.device_count()}')"
89+
python -c "import torch; print(f'BF16 support: {torch.cuda.is_bf16_supported()}')"
90+
echo ""
91+
echo "=== CUTLASS ==="
92+
echo "CUTLASS_PATH: $CUTLASS_PATH"
93+
ls -la $CUTLASS_PATH/include/ | head -5
94+
95+
- name: Install DeepSpeed
96+
run: |
97+
# Initialize CUDA before install so setup.py can detect NCCL version
98+
python -c "import torch; torch.cuda.init(); print(f'NCCL version: {torch.cuda.nccl.version()}')"
99+
# Use --no-build-isolation so setup.py can access pre-installed PyTorch
100+
pip install --no-build-isolation .[dev,1bit,autotuning,deepcompile]
101+
ds_report
102+
103+
- name: Python environment
104+
run: |
105+
pip list
106+
107+
- name: Unit tests (parallel)
108+
run: |
109+
export TORCH_CUDA_ARCH_LIST="8.9"
110+
cd tests
111+
# Skip tests requiring unavailable hardware or known issues:
112+
# - nvme checkpointing: no nvme device
113+
# - GDS tests: no GPUDirect Storage support
114+
# - launcher user_args: pdsh requires SSH server
115+
# - zenflow: Stage 3 tests have pre-existing bugs + CUDA/fork issues
116+
rm -rf /mnt/aio/pytest
117+
pytest --instafail --timeout 600 --forked -n 8 --basetemp=/mnt/aio/pytest unit/ \
118+
--ignore=unit/runtime/zero/test_nvme_checkpointing.py \
119+
--ignore=unit/ops/aio/test_gds.py \
120+
--ignore=unit/launcher/test_user_args.py \
121+
--ignore=unit/runtime/zenflow \
122+
--ignore=unit/ops/adam/test_zf_torch_adam.py \
123+
--torch_ver=${{ env.TORCH_VER }} --cuda_ver=${{ env.CUDA_VER }}
124+
125+
- name: Unit tests (sequential)
126+
run: |
127+
export TORCH_CUDA_ARCH_LIST="8.9"
128+
cd tests
129+
rm -rf /mnt/aio/pytest
130+
pytest --instafail --timeout 600 --forked -m 'sequential' --basetemp=/mnt/aio/pytest unit/ \
131+
--ignore=unit/runtime/zero/test_nvme_checkpointing.py \
132+
--ignore=unit/ops/aio/test_gds.py \
133+
--ignore=unit/launcher/test_user_args.py \
134+
--ignore=unit/runtime/zenflow \
135+
--ignore=unit/ops/adam/test_zf_torch_adam.py \
136+
--torch_ver=${{ env.TORCH_VER }} --cuda_ver=${{ env.CUDA_VER }}

tests/unit/common.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -273,6 +273,12 @@ def _launch_procs(self, num_procs, init_method):
273273
self.non_daemonic_procs = True
274274
self.reuse_dist_env = False
275275

276+
# Allow disabling reuse_dist_env via environment variable.
277+
# This is useful for CI full test runs where reusing distributed environment
278+
# can cause pool worker cleanup to hang after tests complete.
279+
if os.environ.get('DS_DISABLE_REUSE_DIST_ENV', '0') == '1':
280+
self.reuse_dist_env = False
281+
276282
# Set start method to `forkserver` (or `fork`)
277283
mp.set_start_method('forkserver', force=True)
278284

0 commit comments

Comments
 (0)