[REVIEW] Set `cudaFuncAttributeMaxDynamicSharedMemorySize` with thread-safety by mythrocks · Pull Request #1771 · rapidsai/cuvs

mythrocks · 2026-02-04T18:58:52Z

This commit ensures that when the setup runs for the CAGRA search kernel,
the calls to set cudaFuncAttributeMaxDynamicSharedMemorySize are thread-safe.

The CUDA-context in a process is shared across all (CPU) threads.

Consider 2 (CPU) threads A and B, attempting to set max-dynamic-shared-mem
sizes to 3KB and 2KB respectively. Consider the following race condition:

Thread A successfully sets 3KB, but then gets switched before
the CAGRA kernel is called.
Thread B sets 2KB, calls the kernel successfully, and exits.
Thread A calls its kernel, which attempts to use 3KB of smem. But the
CUDA context's max-smem is currently set to 2KB. This causes the kernel
launch to fail.

The error manifests as listed in rapidsai/cuvs-lucene#93:

CUDA error encountered at: file=/.../search_single_cta_kernel-inl.cuh line=2207: call='cudaPeekAtLastError()', Reason=cudaErrorInvalidValue:invalid argument
Obtained 8 stack frames
...

This commit adds logic to track the high-water-mark for max-dynamic-shared-memory,
and makes setting that value in the CUDA-context uninterruptible. This way, every thread
of the same kernel-type gets the largest maximum threshold set so far.

Signed-off-by: MithunR mithunr@nvidia.com

This commit ensures that when the setup runs for the CAGRA search kernel, the calls to set `cudaFuncAttributeMaxDynamicSharedMemorySize` are thread-safe. The CUDA-context in a process is shared across all (CPU) threads. Consider 2 (CPU) threads A and B, attempting to set max-dynamic-shared-mem sizes to 3KB and 2KB respectively. Consider the following race condition: 1. Thread A successfully sets 3KB, but then gets switched before the CAGRA kernel is called. 2. Thread B sets 2KB, calls the kernel successfully, and exits. 3. Thread A calls its kernel, which attempts to use 3KB of smem. But the CUDA context's max-smem is currently set to 2KB. This causes the kernel launch to fail. The error manifests as listed in rapidsai/cuvs-lucene#93: ``` CUDA error encountered at: file=/.../search_single_cta_kernel-inl.cuh line=2207: call='cudaPeekAtLastError()', Reason=cudaErrorInvalidValue:invalid argument Obtained 8 stack frames ... ``` This commit adds logic to track the high-water-mark for max-dynamic-shared-memory, and makes setting that value in the CUDA-context uninterruptible. This way, every thread of the same kernel-type gets the largest maximum threshold set so far. Signed-off-by: MithunR <mithunr@nvidia.com>

Signed-off-by: MithunR <mithunr@nvidia.com>

…-kernel-error

mythrocks · 2026-02-04T19:14:10Z

For the kind attention of @tfeher, @cjnolet.

I've refrained from introducing a new .cu / translation unit, and introduced the std::mutex as a function static. Please let me know if object lifetimes are a concern. By my reading of the code, different instantiations will live separately and not clash.

Note that select_and_run is a function template, with a different instantiation per search_kernel_config definition.
Note, also, that cudaFuncSetAttribute is itself a function template, parametrized on the function type.

The above should imply that the cudaFuncAttributeMaxDynamicSharedMemorySize is tracked separately for search_single_cta_float_uint32, search_single_cta_int8_uint32, etc.

Signed-off-by: MithunR <mithunr@nvidia.com>

mythrocks · 2026-02-04T23:13:54Z

CI is failing, pending merge for #1759.

In the meantime, ~~I'll move~~ I have moved the other calls that set cudaFuncAttributeMaxDynamicSharedMemorySize to use the new utility.

Signed-off-by: MithunR <mithunr@nvidia.com>

…ally_set()` function. Signed-off-by: MithunR <mithunr@nvidia.com>

viclafargue

Thank you for working on this. LGTM, apart from small comment.

The only issue that I can think of is that cudaFuncAttributeMaxDynamicSharedMemorySize can at one point be set to values much higher than needed causing performance issues (lower occupancy) as discussed.

I though about it and it looks to me like registering a CUDA kernel for launch should be almost instantaneous. Therefore we could have something like this :

static auto mutex = std::mutex{};
{
    auto guard = std::lock_guard<std::mutex>{mutex};
    cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size);
    kernel<<<block_dims, thread_dims, smem_size, stream>>>(...);
}

viclafargue · 2026-02-05T09:36:34Z

cpp/src/neighbors/detail/smem_utils.cuh

+void optionally_set_larger_max_smem_size(uint32_t smem_size, KernelT& kernel)
+{
+  static auto mutex                 = std::mutex{};
+  static auto running_max_smem_size = uint32_t{0};


I would maybe make running_max_smem_size be a std::atomic for more safety.

viclafargue · 2026-02-05T09:39:40Z

cpp/src/neighbors/detail/smem_utils.cuh

+ * @param smem_size The size of the dynamic shared memory to be set.
+ * @param kernel The kernel to be set.
+ */
+template <typename KernelT>


Should work fine if the kernel type is different for each CUDA kernel instantiation. I am not familiar with this.

achirkin · 2026-02-05T11:46:48Z

I'd suggest we hold off this PR for a little while before finishing the discussion in rapidsai/cuvs-lucene#93 . While the explanation of the bug seems plausible, I'm not 100% convinced this is the real cause; and the extra sync introduced here may be just hiding the other issue by serializing execution.

Even if we confirm this is indeed the case, I'd like to do a little more benchmarking to better understand how this affects high-concurrency scenario of using the persistent CAGRA kernel. In such a scenario any, even the smallest, call to the CUDA api may cause a slowdown, because all calls to CUDA context are serialized (i.e. guarded by mutexes).

mythrocks added 2 commits February 4, 2026 10:52

Copyright date. Formatting.

305cb55

Signed-off-by: MithunR <mithunr@nvidia.com>

mythrocks self-assigned this Feb 4, 2026

mythrocks requested a review from a team as a code owner February 4, 2026 18:58

github-project-automation bot added this to Vector Search, ML, & Data Mining Release Board Feb 4, 2026

mythrocks added bug Something isn't working non-breaking Introduces a non-breaking change labels Feb 4, 2026

Merge remote-tracking branch 'origin/main' into cuda-invalid-argument…

4463666

…-kernel-error

mythrocks added the Waiting for review label Feb 4, 2026

mythrocks changed the title ~~Set cudaFuncAttributeMaxDynamicSharedMemorySize with thread-safety~~ [REVIEW] Set cudaFuncAttributeMaxDynamicSharedMemorySize with thread-safety Feb 4, 2026

mythrocks mentioned this pull request Feb 4, 2026

cuvsCagraSearch returned 0 (Reason=cudaErrorInvalidValue:invalid argument) rapidsai/cuvs-lucene#93

Open

Moved other call to set cudaFuncAttributeMaxDynamicSharedMemorySize.

677f6fe

Signed-off-by: MithunR <mithunr@nvidia.com>

mythrocks added 3 commits February 4, 2026 15:23

Moved call-sites in other files.

5e9e410

Signed-off-by: MithunR <mithunr@nvidia.com>

Copyright date.

f55cbd3

Signed-off-by: MithunR <mithunr@nvidia.com>

Resolved merge conflicts. Moved all call-sites to use the new `option…

b394b19

…ally_set()` function. Signed-off-by: MithunR <mithunr@nvidia.com>

viclafargue reviewed Feb 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Set `cudaFuncAttributeMaxDynamicSharedMemorySize` with thread-safety#1771

[REVIEW] Set `cudaFuncAttributeMaxDynamicSharedMemorySize` with thread-safety#1771
mythrocks wants to merge 7 commits intorapidsai:mainfrom
mythrocks:cuda-invalid-argument-kernel-error

mythrocks commented Feb 4, 2026

Uh oh!

mythrocks commented Feb 4, 2026

Uh oh!

mythrocks commented Feb 4, 2026 •

edited

Loading

Uh oh!

viclafargue left a comment •

edited

Loading

Uh oh!

viclafargue Feb 5, 2026

Uh oh!

viclafargue Feb 5, 2026

Uh oh!

achirkin commented Feb 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mythrocks commented Feb 4, 2026

Uh oh!

mythrocks commented Feb 4, 2026

Uh oh!

mythrocks commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viclafargue left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viclafargue Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

viclafargue Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

achirkin commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mythrocks commented Feb 4, 2026 •

edited

Loading

viclafargue left a comment •

edited

Loading

achirkin commented Feb 5, 2026 •

edited

Loading