|
| 1 | +# Mellea Test Suite |
1 | 2 |
|
| 3 | +Test files must be named as `test_*.py` so that pydocstyle ignores them. |
2 | 4 |
|
3 | | -Test files must be named as "test_*.py" so that pydocstyle ignores them |
| 5 | +## Running Tests |
| 6 | + |
| 7 | +```bash |
| 8 | +# Fast tests only (~2 min) - skips qualitative and slow tests |
| 9 | +uv run pytest -m "not qualitative" |
| 10 | + |
| 11 | +# Default - includes qualitative tests, skips slow tests |
| 12 | +uv run pytest |
| 13 | + |
| 14 | +# All tests including slow tests (>5 min) |
| 15 | +uv run pytest -m slow |
| 16 | +uv run pytest # without pytest.ini config |
| 17 | +``` |
| 18 | + |
| 19 | +## GPU Testing on CUDA Systems |
| 20 | + |
| 21 | +### The Problem: CUDA EXCLUSIVE_PROCESS Mode |
| 22 | + |
| 23 | +When running GPU tests on systems with `EXCLUSIVE_PROCESS` mode (common on HPC clusters), you may encounter "CUDA device busy" errors. This happens because: |
| 24 | + |
| 25 | +1. **Parent Process Context**: The pytest parent process creates a CUDA context when running regular tests |
| 26 | +2. **Subprocess Blocking**: Example tests run in subprocesses (via `docs/examples/conftest.py`) |
| 27 | +3. **Exclusive Access**: In `EXCLUSIVE_PROCESS` mode, only one process can hold a CUDA context per GPU |
| 28 | +4. **Result**: Subprocesses fail with "CUDA device busy" when the parent still holds the context |
| 29 | + |
| 30 | +### Solution 1: NVIDIA MPS (Recommended) |
| 31 | + |
| 32 | +**NVIDIA Multi-Process Service (MPS)** allows multiple processes to share a GPU in `EXCLUSIVE_PROCESS` mode: |
| 33 | + |
| 34 | +```bash |
| 35 | +# Enable MPS in your job scheduler configuration |
| 36 | +# Consult your HPC documentation for specific syntax |
| 37 | +``` |
| 38 | + |
| 39 | +### Why This Matters |
| 40 | + |
| 41 | +The test infrastructure runs examples in subprocesses (see `docs/examples/conftest.py`) to: |
| 42 | +- Isolate example execution environments |
| 43 | +- Capture stdout/stderr cleanly |
| 44 | +- Prevent cross-contamination between examples |
| 45 | + |
| 46 | +However, this creates the "Parent Trap": the parent pytest process holds a CUDA context from running regular tests, blocking subprocesses from accessing the GPU. |
| 47 | + |
| 48 | +### Technical Details |
| 49 | + |
| 50 | +**CUDA Context Lifecycle**: |
| 51 | +- Created on first CUDA operation (e.g., `torch.cuda.is_available()`) |
| 52 | +- Persists until process exit or explicit `cudaDeviceReset()` |
| 53 | +- In `EXCLUSIVE_PROCESS` mode, blocks other processes from GPU access |
| 54 | + |
| 55 | +**MPS Architecture**: |
| 56 | +- Runs as a proxy service between applications and GPU driver |
| 57 | +- Multiplexes CUDA contexts from multiple processes onto single GPU |
| 58 | +- Transparent to applications - no code changes needed |
| 59 | +- Requires explicit enablement via job scheduler flags |
| 60 | + |
| 61 | +**Alternative Approaches Tried** (documented in `GPU_PARENT_TRAP_SOLUTION.md`): |
| 62 | +- ❌ `torch.cuda.empty_cache()` - Only affects PyTorch allocator, not driver context |
| 63 | +- ❌ `cudaDeviceReset()` in subprocesses - Parent still holds context |
| 64 | +- ❌ Inter-example delays - Doesn't release parent context |
| 65 | +- ❌ pynvml polling - Can't force parent to release context |
| 66 | +- ✅ MPS - Allows GPU sharing without code changes |
| 67 | + |
| 68 | +## Test Markers |
| 69 | + |
| 70 | +See [`MARKERS_GUIDE.md`](MARKERS_GUIDE.md) for complete marker documentation. |
| 71 | + |
| 72 | +Key markers for GPU testing: |
| 73 | +- `@pytest.mark.huggingface` - Requires HuggingFace backend (local, GPU-heavy) |
| 74 | +- `@pytest.mark.requires_gpu` - Requires GPU hardware |
| 75 | +- `@pytest.mark.requires_heavy_ram` - Requires 48GB+ RAM |
| 76 | +- `@pytest.mark.slow` - Tests taking >5 minutes |
| 77 | + |
| 78 | +## Coverage |
| 79 | + |
| 80 | +Coverage reports are generated in `htmlcov/` and `coverage.json`. |
0 commit comments