Skip to content

cuda.compute: Use native CCCL.c support for stateful ops#7500

Merged
shwina merged 13 commits intoNVIDIA:mainfrom
shwina:add-stateful-ops
Feb 4, 2026
Merged

cuda.compute: Use native CCCL.c support for stateful ops#7500
shwina merged 13 commits intoNVIDIA:mainfrom
shwina:add-stateful-ops

Conversation

@shwina
Copy link
Contributor

@shwina shwina commented Feb 4, 2026

Description

Closes #7498.

See the issue above for design/rationale.

Performance

As described in the issue, due to the overhead of inspecting/extracting state, there's an ~1us overhead introduced to every invocation. This can and should be eliminated for stateless operations.

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@shwina shwina requested review from a team as code owners February 4, 2026 16:01
@github-project-automation github-project-automation bot moved this to Todo in CCCL Feb 4, 2026
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Feb 4, 2026


class _StatelessOp(OpAdapter):
"""Adapter for stateles callables."""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo


Then the transformed function will be:

def func(x, state): return x + state[0]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe state should go first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"""
import struct

from . import types as cccl_types
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these imports need to be here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No (fixed)

Returns:
A value with appropriate subtype of _BaseOp
"""
from ._jit import to_jit_op_adapter
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to move this out because it's a non-trivial cost.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, _jit needs to import OpAdapter to subclass it. I think we should make it a protocol so that _jit.py won't need to import from here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xref: #7503

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about other algorithms like merge_sort?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@github-actions

This comment has been minimized.

Copy link
Contributor

@NaderAlAwar NaderAlAwar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving but we must revisit and measure the impact of making op adapters in every __call__

# by stateful op machinery, which enables updating the state
# (pointers). Thus, we only cache on the dtype and shape of
# the referenced array, but not its pointer.
return (get_dtype(value), get_shape(value))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Remind me again why we cache on shape?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an (admittedly made up) example for when it could matter:

import numpy as np, cupy as cp, cuda.compute as cc

def make_op(arr):
    def op(x):
        return x > len(arr)
    return op

d_in = cp.asarray([1, 2, 3])
d_out = cp.empty_like(d_in, dtype=bool)
op1 = make_op(cp.empty(1))  # len(arr) == 1
op2 = make_op(cp.empty(2))  # len(arr) == 2

>>> cc.unary_transform(d_in, d_out, op1, len(d_in)); print(d_out)
[False  True  True]

>>> cc.unary_transform(d_in, d_out, op2, len(d_in)); print(d_out)
[False False  True]

# without including the shape in the hash, the second result would also be `[False, True, True]`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this as a unit test.


for name in code.co_names:
val = func.__globals__.get(name)
if val is not None and hasattr(val, "__cuda_array_interface__"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: use the newly added is_device_array instead of hasattr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@shwina shwina enabled auto-merge (squash) February 4, 2026 21:13
@github-actions
Copy link
Contributor

github-actions bot commented Feb 4, 2026

🥳 CI Workflow Results

🟩 Finished in 2h 21m: Pass: 100%/56 | Total: 19h 22m | Max: 1h 05m

See results here.

@shwina shwina merged commit 246bb41 into NVIDIA:main Feb 4, 2026
74 of 75 checks passed
@github-project-automation github-project-automation bot moved this from In Review to Done in CCCL Feb 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

cuda.compute: Stateful operators are slow because they almost always need recompilation

2 participants