Skip to content

[REVIEW] Set cudaFuncAttributeMaxDynamicSharedMemorySize with thread-safety#1771

Open
mythrocks wants to merge 7 commits intorapidsai:mainfrom
mythrocks:cuda-invalid-argument-kernel-error
Open

[REVIEW] Set cudaFuncAttributeMaxDynamicSharedMemorySize with thread-safety#1771
mythrocks wants to merge 7 commits intorapidsai:mainfrom
mythrocks:cuda-invalid-argument-kernel-error

Conversation

@mythrocks
Copy link
Contributor

This commit ensures that when the setup runs for the CAGRA search kernel,
the calls to set cudaFuncAttributeMaxDynamicSharedMemorySize are thread-safe.

The CUDA-context in a process is shared across all (CPU) threads.

Consider 2 (CPU) threads A and B, attempting to set max-dynamic-shared-mem
sizes to 3KB and 2KB respectively. Consider the following race condition:

  1. Thread A successfully sets 3KB, but then gets switched before
    the CAGRA kernel is called.
  2. Thread B sets 2KB, calls the kernel successfully, and exits.
  3. Thread A calls its kernel, which attempts to use 3KB of smem. But the
    CUDA context's max-smem is currently set to 2KB. This causes the kernel
    launch to fail.

The error manifests as listed in rapidsai/cuvs-lucene#93:

CUDA error encountered at: file=/.../search_single_cta_kernel-inl.cuh line=2207: call='cudaPeekAtLastError()', Reason=cudaErrorInvalidValue:invalid argument
Obtained 8 stack frames
...

This commit adds logic to track the high-water-mark for max-dynamic-shared-memory,
and makes setting that value in the CUDA-context uninterruptible. This way, every thread
of the same kernel-type gets the largest maximum threshold set so far.

Signed-off-by: MithunR mithunr@nvidia.com

This commit ensures that when the setup runs for the CAGRA search kernel,
the calls to set `cudaFuncAttributeMaxDynamicSharedMemorySize` are thread-safe.

The CUDA-context in a process is shared across all (CPU) threads.

Consider 2 (CPU) threads A and B, attempting to set max-dynamic-shared-mem
sizes to 3KB and 2KB respectively.  Consider the following race condition:
1. Thread A successfully sets 3KB, but then gets switched before
   the CAGRA kernel is called.
2. Thread B sets 2KB, calls the kernel successfully, and exits.
3. Thread A calls its kernel, which attempts to use 3KB of smem. But the
   CUDA context's max-smem is currently set to 2KB.  This causes the kernel
   launch to fail.

The error manifests as listed in rapidsai/cuvs-lucene#93:
```
CUDA error encountered at: file=/.../search_single_cta_kernel-inl.cuh line=2207: call='cudaPeekAtLastError()', Reason=cudaErrorInvalidValue:invalid argument
Obtained 8 stack frames
...
```

This commit adds logic to track the high-water-mark for max-dynamic-shared-memory,
and makes setting that value in the CUDA-context uninterruptible. This way, every thread
of the same kernel-type gets the largest maximum threshold set so far.

Signed-off-by: MithunR <mithunr@nvidia.com>
Signed-off-by: MithunR <mithunr@nvidia.com>
@mythrocks mythrocks self-assigned this Feb 4, 2026
@mythrocks mythrocks requested a review from a team as a code owner February 4, 2026 18:58
@mythrocks mythrocks added bug Something isn't working non-breaking Introduces a non-breaking change labels Feb 4, 2026
@mythrocks mythrocks changed the title Set cudaFuncAttributeMaxDynamicSharedMemorySize with thread-safety [REVIEW] Set cudaFuncAttributeMaxDynamicSharedMemorySize with thread-safety Feb 4, 2026
@mythrocks
Copy link
Contributor Author

For the kind attention of @tfeher, @cjnolet.

I've refrained from introducing a new .cu / translation unit, and introduced the std::mutex as a function static. Please let me know if object lifetimes are a concern. By my reading of the code, different instantiations will live separately and not clash.

Note that select_and_run is a function template, with a different instantiation per search_kernel_config definition.
Note, also, that cudaFuncSetAttribute is itself a function template, parametrized on the function type.

The above should imply that the cudaFuncAttributeMaxDynamicSharedMemorySize is tracked separately for search_single_cta_float_uint32, search_single_cta_int8_uint32, etc.

@mythrocks
Copy link
Contributor Author

mythrocks commented Feb 4, 2026

CI is failing, pending merge for #1759.

In the meantime, I'll move I have moved the other calls that set cudaFuncAttributeMaxDynamicSharedMemorySize to use the new utility.

Signed-off-by: MithunR <mithunr@nvidia.com>
Signed-off-by: MithunR <mithunr@nvidia.com>
…ally_set()` function.

Signed-off-by: MithunR <mithunr@nvidia.com>
Copy link
Contributor

@viclafargue viclafargue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this. LGTM, apart from small comment.

The only issue that I can think of is that cudaFuncAttributeMaxDynamicSharedMemorySize can at one point be set to values much higher than needed causing performance issues (lower occupancy) as discussed.

I though about it and it looks to me like registering a CUDA kernel for launch should be almost instantaneous. Therefore we could have something like this :

static auto mutex = std::mutex{};
{
    auto guard = std::lock_guard<std::mutex>{mutex};
    cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size);
    kernel<<<block_dims, thread_dims, smem_size, stream>>>(...);
}

void optionally_set_larger_max_smem_size(uint32_t smem_size, KernelT& kernel)
{
static auto mutex = std::mutex{};
static auto running_max_smem_size = uint32_t{0};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would maybe make running_max_smem_size be a std::atomic for more safety.

* @param smem_size The size of the dynamic shared memory to be set.
* @param kernel The kernel to be set.
*/
template <typename KernelT>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should work fine if the kernel type is different for each CUDA kernel instantiation. I am not familiar with this.

@achirkin
Copy link
Contributor

achirkin commented Feb 5, 2026

I'd suggest we hold off this PR for a little while before finishing the discussion in rapidsai/cuvs-lucene#93 . While the explanation of the bug seems plausible, I'm not 100% convinced this is the real cause; and the extra sync introduced here may be just hiding the other issue by serializing execution.

Even if we confirm this is indeed the case, I'd like to do a little more benchmarking to better understand how this affects high-concurrency scenario of using the persistent CAGRA kernel. In such a scenario any, even the smallest, call to the CUDA api may cause a slowdown, because all calls to CUDA context are serialized (i.e. guarded by mutexes).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working non-breaking Introduces a non-breaking change Waiting for review

Projects

Development

Successfully merging this pull request may close these issues.

3 participants