cuda.compute: Use native CCCL.c support for stateful ops#7500
cuda.compute: Use native CCCL.c support for stateful ops#7500shwina merged 13 commits intoNVIDIA:mainfrom
Conversation
31d045d to
a094bde
Compare
|
|
||
|
|
||
| class _StatelessOp(OpAdapter): | ||
| """Adapter for stateles callables.""" |
|
|
||
| Then the transformed function will be: | ||
|
|
||
| def func(x, state): return x + state[0] |
There was a problem hiding this comment.
Maybe state should go first.
| """ | ||
| import struct | ||
|
|
||
| from . import types as cccl_types |
There was a problem hiding this comment.
Do these imports need to be here?
| Returns: | ||
| A value with appropriate subtype of _BaseOp | ||
| """ | ||
| from ._jit import to_jit_op_adapter |
There was a problem hiding this comment.
Try to move this out because it's a non-trivial cost.
There was a problem hiding this comment.
Unfortunately, _jit needs to import OpAdapter to subclass it. I think we should make it a protocol so that _jit.py won't need to import from here.
There was a problem hiding this comment.
What about other algorithms like merge_sort?
This comment has been minimized.
This comment has been minimized.
NaderAlAwar
left a comment
There was a problem hiding this comment.
Approving but we must revisit and measure the impact of making op adapters in every __call__
| # by stateful op machinery, which enables updating the state | ||
| # (pointers). Thus, we only cache on the dtype and shape of | ||
| # the referenced array, but not its pointer. | ||
| return (get_dtype(value), get_shape(value)) |
There was a problem hiding this comment.
Question: Remind me again why we cache on shape?
There was a problem hiding this comment.
Here's an (admittedly made up) example for when it could matter:
import numpy as np, cupy as cp, cuda.compute as cc
def make_op(arr):
def op(x):
return x > len(arr)
return op
d_in = cp.asarray([1, 2, 3])
d_out = cp.empty_like(d_in, dtype=bool)
op1 = make_op(cp.empty(1)) # len(arr) == 1
op2 = make_op(cp.empty(2)) # len(arr) == 2
>>> cc.unary_transform(d_in, d_out, op1, len(d_in)); print(d_out)
[False True True]
>>> cc.unary_transform(d_in, d_out, op2, len(d_in)); print(d_out)
[False False True]
# without including the shape in the hash, the second result would also be `[False, True, True]`There was a problem hiding this comment.
Added this as a unit test.
|
|
||
| for name in code.co_names: | ||
| val = func.__globals__.get(name) | ||
| if val is not None and hasattr(val, "__cuda_array_interface__"): |
There was a problem hiding this comment.
Important: use the newly added is_device_array instead of hasattr
a094bde to
55e82dc
Compare
🥳 CI Workflow Results🟩 Finished in 2h 21m: Pass: 100%/56 | Total: 19h 22m | Max: 1h 05mSee results here. |
Description
Closes #7498.
See the issue above for design/rationale.
Performance
As described in the issue, due to the overhead of inspecting/extracting state, there's an ~1us overhead introduced to every invocation. This can and should be eliminated for stateless operations.
Checklist