Skip to content

Running python tests at the same time as async can cause memory issues #28

@pechersky

Description

@pechersky

Built with CUDA 13, CCCL 3.0, rdkit (pip installed) 2025.9.1, built at tag v0.2.0

Test command

python3.12 -m pytest --pyargs /app/nvMolKit/nvmolkit/tests

When running with -k "not async", many of the tests below pass.

Example error

/app/nvMolKit/nvmolkit/tests/test_types.py:35: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

device = None

    def synchronize(device: "Device" = None) -> None:
        r"""Wait for all kernels in all streams on a CUDA device to complete.
    
        Args:
            device (torch.device or int, optional): device for which to synchronize.
                It uses the current device, given by :func:`~torch.cuda.current_device`,
                if :attr:`device` is ``None`` (default).
        """
        _lazy_init()
        with torch.cuda.device(device):
>           return torch._C._cuda_synchronize()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E           torch.AcceleratorError: CUDA error: an illegal memory access was encountered
E           Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
E           CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
E           For debugging consider passing CUDA_LAUNCH_BLOCKING=1
E           Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

/opt/python/cp312-cp312/lib/python3.12/site-packages/torch/cuda/__init__.py:1083: AcceleratorError

All tests that failed:

FAILED ../app/nvMolKit/nvmolkit/tests/test_mmff_optimization.py::test_mmff_optimization_batch_vs_rdkit[1-0-gpu_ids1] - AssertionError: Molecule 0, Conformer 0: energy mismatch: RDKit=26.874311, nvMolKit=125669.641792, abs_diff=125642.767481, rel_error=4675.199514
FAILED ../app/nvMolKit/nvmolkit/tests/test_mmff_optimization.py::test_mmff_optimization_batch_vs_rdkit[1-2-gpu_ids1] - RuntimeError: Encountered CUDA error 101: invalid device ordinal
FAILED ../app/nvMolKit/nvmolkit/tests/test_mmff_optimization.py::test_mmff_optimization_batch_vs_rdkit[1-5-gpu_ids1] - RuntimeError: Encountered CUDA error 700: an illegal memory access was encountered
FAILED ../app/nvMolKit/nvmolkit/tests/test_mmff_optimization.py::test_mmff_optimization_batch_vs_rdkit[3-0-gpu_ids1] - RuntimeError: Encountered CUDA error 700: an illegal memory access was encountered
FAILED ../app/nvMolKit/nvmolkit/tests/test_mmff_optimization.py::test_mmff_optimization_batch_vs_rdkit[3-2-gpu_ids1] - RuntimeError: Encountered CUDA error 700: an illegal memory access was encountered
FAILED ../app/nvMolKit/nvmolkit/tests/test_mmff_optimization.py::test_mmff_optimization_batch_vs_rdkit[3-5-gpu_ids1] - RuntimeError: Encountered CUDA error 700: an illegal memory access was encountered
FAILED ../app/nvMolKit/nvmolkit/tests/test_mmff_optimization.py::test_mmff_optimization_allows_large_molecule_interleaved - RuntimeError: Encountered CUDA error 700: an illegal memory access was encountered
FAILED ../app/nvMolKit/nvmolkit/tests/test_mmff_optimization.py::test_error_case_throws_properly - RuntimeError: Encountered CUDA error 700: an illegal memory access was encountered
FAILED ../app/nvMolKit/nvmolkit/tests/test_similarity.py::test_cross_similarity_fp_mismatch[tanimoto] - RuntimeError: Encountered CUDA error 700: an illegal memory access was encountered
FAILED ../app/nvMolKit/nvmolkit/tests/test_similarity.py::test_cross_similarity_fp_mismatch[cosine] - RuntimeError: Encountered CUDA error 700: an illegal memory access was encountered
FAILED ../app/nvMolKit/nvmolkit/tests/test_similarity.py::test_nvmolkit_cross_tanimoto_similarity_from_nvmolkit_fp - RuntimeError: Encountered CUDA error 700: an illegal memory access was encountered
FAILED ../app/nvMolKit/nvmolkit/tests/test_similarity.py::test_nxm_cross_tanimoto_similarity_from_nvmolkit_fp[nxmdims0] - RuntimeError: Encountered CUDA error 700: an illegal memory access was encountered
FAILED ../app/nvMolKit/nvmolkit/tests/test_similarity.py::test_nxm_cross_tanimoto_similarity_from_nvmolkit_fp[nxmdims1] - RuntimeError: Encountered CUDA error 700: an illegal memory access was encountered
FAILED ../app/nvMolKit/nvmolkit/tests/test_similarity.py::test_nxm_cross_tanimoto_similarity_from_nvmolkit_fp[nxmdims2] - RuntimeError: Encountered CUDA error 700: an illegal memory access was encountered
FAILED ../app/nvMolKit/nvmolkit/tests/test_similarity.py::test_nxm_cross_tanimoto_similarity_from_nvmolkit_fp[nxmdims3] - RuntimeError: Encountered CUDA error 700: an illegal memory access was encountered
FAILED ../app/nvMolKit/nvmolkit/tests/test_similarity.py::test_nxm_cross_tanimoto_similarity_from_packing[nxmdims0] - torch.AcceleratorError: CUDA error: an illegal memory access was encountered
FAILED ../app/nvMolKit/nvmolkit/tests/test_similarity.py::test_nxm_cross_tanimoto_similarity_from_packing[nxmdims1] - torch.AcceleratorError: CUDA error: an illegal memory access was encountered
FAILED ../app/nvMolKit/nvmolkit/tests/test_similarity.py::test_nxm_cross_tanimoto_similarity_from_packing[nxmdims2] - torch.AcceleratorError: CUDA error: an illegal memory access was encountered
FAILED ../app/nvMolKit/nvmolkit/tests/test_similarity.py::test_nvmolkit_cross_cosine_similarity_from_nvmolkit_fp - RuntimeError: Encountered CUDA error 700: an illegal memory access was encountered
FAILED ../app/nvMolKit/nvmolkit/tests/test_similarity.py::test_nxm_cross_cosine_similarity_from_nvmolkit_fp[nxmdims0] - RuntimeError: Encountered CUDA error 700: an illegal memory access was encountered
FAILED ../app/nvMolKit/nvmolkit/tests/test_similarity.py::test_nxm_cross_cosine_similarity_from_nvmolkit_fp[nxmdims1] - RuntimeError: Encountered CUDA error 700: an illegal memory access was encountered
FAILED ../app/nvMolKit/nvmolkit/tests/test_similarity.py::test_nxm_cross_cosine_similarity_from_nvmolkit_fp[nxmdims2] - RuntimeError: Encountered CUDA error 700: an illegal memory access was encountered
FAILED ../app/nvMolKit/nvmolkit/tests/test_similarity.py::test_nxm_cross_cosine_similarity_from_nvmolkit_fp[nxmdims3] - RuntimeError: Encountered CUDA error 700: an illegal memory access was encountered
FAILED ../app/nvMolKit/nvmolkit/tests/test_similarity.py::test_nxm_cross_cosine_similarity_from_nvmolkit_fp[nxmdims4] - RuntimeError: Encountered CUDA error 700: an illegal memory access was encountered
FAILED ../app/nvMolKit/nvmolkit/tests/test_similarity.py::test_nxm_cross_cosine_similarity_from_nvmolkit_fp[nxmdims5] - RuntimeError: Encountered CUDA error 700: an illegal memory access was encountered
FAILED ../app/nvMolKit/nvmolkit/tests/test_similarity.py::test_memory_constrained_tanimoto_self - RuntimeError: Encountered CUDA error 700: an illegal memory access was encountered
FAILED ../app/nvMolKit/nvmolkit/tests/test_similarity.py::test_memory_constrained_tanimoto_cross[nxmdims0] - RuntimeError: Encountered CUDA error 700: an illegal memory access was encountered
FAILED ../app/nvMolKit/nvmolkit/tests/test_similarity.py::test_memory_constrained_tanimoto_cross[nxmdims1] - RuntimeError: Encountered CUDA error 700: an illegal memory access was encountered
FAILED ../app/nvMolKit/nvmolkit/tests/test_similarity.py::test_memory_constrained_tanimoto_cross[nxmdims2] - RuntimeError: Encountered CUDA error 700: an illegal memory access was encountered
FAILED ../app/nvMolKit/nvmolkit/tests/test_similarity.py::test_memory_constrained_cosine_self - RuntimeError: Encountered CUDA error 700: an illegal memory access was encountered
FAILED ../app/nvMolKit/nvmolkit/tests/test_similarity.py::test_memory_constrained_cosine_cross[nxmdims0] - RuntimeError: Encountered CUDA error 700: an illegal memory access was encountered
FAILED ../app/nvMolKit/nvmolkit/tests/test_similarity.py::test_memory_constrained_cosine_cross[nxmdims1] - RuntimeError: Encountered CUDA error 700: an illegal memory access was encountered
FAILED ../app/nvMolKit/nvmolkit/tests/test_similarity.py::test_memory_constrained_cosine_cross[nxmdims2] - RuntimeError: Encountered CUDA error 700: an illegal memory access was encountered
FAILED ../app/nvMolKit/nvmolkit/tests/test_similarity.py::test_memory_constrained_segmented_path_large_cross[tanimoto] - torch.AcceleratorError: CUDA error: an illegal memory access was encountered
FAILED ../app/nvMolKit/nvmolkit/tests/test_similarity.py::test_memory_constrained_segmented_path_large_cross[cosine] - torch.AcceleratorError: CUDA error: an illegal memory access was encountered
FAILED ../app/nvMolKit/nvmolkit/tests/test_types.py::test_async_gpu_result_release_frees_memory - torch.AcceleratorError: CUDA error: an illegal memory access was encountered

nvidia-smi:

$ nvidia-smi
Tue Oct 21 03:25:06 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.02              Driver Version: 581.42         CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A4500 Laptop GPU    On  |   00000000:01:00.0 Off |                  Off |
| N/A   50C    P0             33W /   91W |       0MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions