[REVIEW] Set cudaFuncAttributeMaxDynamicSharedMemorySize with thread-safety#1771
[REVIEW] Set cudaFuncAttributeMaxDynamicSharedMemorySize with thread-safety#1771mythrocks wants to merge 7 commits intorapidsai:mainfrom
cudaFuncAttributeMaxDynamicSharedMemorySize with thread-safety#1771Conversation
This commit ensures that when the setup runs for the CAGRA search kernel, the calls to set `cudaFuncAttributeMaxDynamicSharedMemorySize` are thread-safe. The CUDA-context in a process is shared across all (CPU) threads. Consider 2 (CPU) threads A and B, attempting to set max-dynamic-shared-mem sizes to 3KB and 2KB respectively. Consider the following race condition: 1. Thread A successfully sets 3KB, but then gets switched before the CAGRA kernel is called. 2. Thread B sets 2KB, calls the kernel successfully, and exits. 3. Thread A calls its kernel, which attempts to use 3KB of smem. But the CUDA context's max-smem is currently set to 2KB. This causes the kernel launch to fail. The error manifests as listed in rapidsai/cuvs-lucene#93: ``` CUDA error encountered at: file=/.../search_single_cta_kernel-inl.cuh line=2207: call='cudaPeekAtLastError()', Reason=cudaErrorInvalidValue:invalid argument Obtained 8 stack frames ... ``` This commit adds logic to track the high-water-mark for max-dynamic-shared-memory, and makes setting that value in the CUDA-context uninterruptible. This way, every thread of the same kernel-type gets the largest maximum threshold set so far. Signed-off-by: MithunR <mithunr@nvidia.com>
Signed-off-by: MithunR <mithunr@nvidia.com>
cudaFuncAttributeMaxDynamicSharedMemorySize with thread-safetycudaFuncAttributeMaxDynamicSharedMemorySize with thread-safety
|
For the kind attention of @tfeher, @cjnolet. I've refrained from introducing a new Note that The above should imply that the |
Signed-off-by: MithunR <mithunr@nvidia.com>
|
CI is failing, pending merge for #1759. In the meantime, |
Signed-off-by: MithunR <mithunr@nvidia.com>
Signed-off-by: MithunR <mithunr@nvidia.com>
…ally_set()` function. Signed-off-by: MithunR <mithunr@nvidia.com>
There was a problem hiding this comment.
Thank you for working on this. LGTM, apart from small comment.
The only issue that I can think of is that cudaFuncAttributeMaxDynamicSharedMemorySize can at one point be set to values much higher than needed causing performance issues (lower occupancy) as discussed.
I though about it and it looks to me like registering a CUDA kernel for launch should be almost instantaneous. Therefore we could have something like this :
static auto mutex = std::mutex{};
{
auto guard = std::lock_guard<std::mutex>{mutex};
cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size);
kernel<<<block_dims, thread_dims, smem_size, stream>>>(...);
}
| void optionally_set_larger_max_smem_size(uint32_t smem_size, KernelT& kernel) | ||
| { | ||
| static auto mutex = std::mutex{}; | ||
| static auto running_max_smem_size = uint32_t{0}; |
There was a problem hiding this comment.
I would maybe make running_max_smem_size be a std::atomic for more safety.
| * @param smem_size The size of the dynamic shared memory to be set. | ||
| * @param kernel The kernel to be set. | ||
| */ | ||
| template <typename KernelT> |
There was a problem hiding this comment.
Should work fine if the kernel type is different for each CUDA kernel instantiation. I am not familiar with this.
|
I'd suggest we hold off this PR for a little while before finishing the discussion in rapidsai/cuvs-lucene#93 . While the explanation of the bug seems plausible, I'm not 100% convinced this is the real cause; and the extra sync introduced here may be just hiding the other issue by serializing execution. Even if we confirm this is indeed the case, I'd like to do a little more benchmarking to better understand how this affects high-concurrency scenario of using the persistent CAGRA kernel. In such a scenario any, even the smallest, call to the CUDA api may cause a slowdown, because all calls to CUDA context are serialized (i.e. guarded by mutexes). |
This commit ensures that when the setup runs for the CAGRA search kernel,
the calls to set
cudaFuncAttributeMaxDynamicSharedMemorySizeare thread-safe.The CUDA-context in a process is shared across all (CPU) threads.
Consider 2 (CPU) threads A and B, attempting to set max-dynamic-shared-mem
sizes to 3KB and 2KB respectively. Consider the following race condition:
the CAGRA kernel is called.
CUDA context's max-smem is currently set to 2KB. This causes the kernel
launch to fail.
The error manifests as listed in rapidsai/cuvs-lucene#93:
This commit adds logic to track the high-water-mark for max-dynamic-shared-memory,
and makes setting that value in the CUDA-context uninterruptible. This way, every thread
of the same kernel-type gets the largest maximum threshold set so far.
Signed-off-by: MithunR mithunr@nvidia.com