cuda.compute: Use native CCCL.c support for stateful ops by shwina · Pull Request #7500 · NVIDIA/cccl

shwina · 2026-02-04T16:01:44Z

Description

Closes #7498.

See the issue above for design/rationale.

Performance

As described in the issue, due to the overhead of inspecting/extracting state, there's an ~1us overhead introduced to every invocation. This can and should be eliminated for stateless operations.

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

shwina · 2026-02-04T17:36:24Z

python/cuda_cccl/cuda/compute/_jit.py

+
+
+class _StatelessOp(OpAdapter):
+    """Adapter for stateles callables."""


shwina · 2026-02-04T17:40:36Z

python/cuda_cccl/cuda/compute/_jit.py

+
+    Then the transformed function will be:
+
+        def func(x, state): return x + state[0]


Maybe state should go first.

shwina · 2026-02-04T17:41:36Z

python/cuda_cccl/cuda/compute/_jit.py

+    """
+    import struct
+
    from . import types as cccl_types


Do these imports need to be here?

shwina · 2026-02-04T17:50:46Z

python/cuda_cccl/cuda/compute/op.py

    Returns:
        A value with appropriate subtype of _BaseOp
    """
+    from ._jit import to_jit_op_adapter


Try to move this out because it's a non-trivial cost.

Unfortunately, _jit needs to import OpAdapter to subclass it. I think we should make it a protocol so that _jit.py won't need to import from here.

xref: #7503

shwina · 2026-02-04T17:53:55Z

python/cuda_cccl/cuda/compute/algorithms/_reduce.py

What about other algorithms like merge_sort?

NaderAlAwar

Approving but we must revisit and measure the impact of making op adapters in every __call__

NaderAlAwar · 2026-02-04T19:20:38Z

python/cuda_cccl/cuda/compute/_caching.py

+        # by stateful op machinery, which enables updating the state
+        # (pointers). Thus, we only cache on the dtype and shape of
+        # the referenced array, but not its pointer.
+        return (get_dtype(value), get_shape(value))


Question: Remind me again why we cache on shape?

Here's an (admittedly made up) example for when it could matter:

import numpy as np, cupy as cp, cuda.compute as cc def make_op(arr): def op(x): return x > len(arr) return op d_in = cp.asarray([1, 2, 3]) d_out = cp.empty_like(d_in, dtype=bool) op1 = make_op(cp.empty(1)) # len(arr) == 1 op2 = make_op(cp.empty(2)) # len(arr) == 2 >>> cc.unary_transform(d_in, d_out, op1, len(d_in)); print(d_out) [False True True] >>> cc.unary_transform(d_in, d_out, op2, len(d_in)); print(d_out) [False False True] # without including the shape in the hash, the second result would also be `[False, True, True]`

Added this as a unit test.

NaderAlAwar · 2026-02-04T19:22:12Z

python/cuda_cccl/cuda/compute/_jit.py

+
+    for name in code.co_names:
+        val = func.__globals__.get(name)
+        if val is not None and hasattr(val, "__cuda_array_interface__"):


Important: use the newly added is_device_array instead of hasattr

github-actions · 2026-02-04T23:10:50Z

🥳 CI Workflow Results

🟩 Finished in 2h 21m: Pass: 100%/56 | Total: 19h 22m | Max: 1h 05m

See results here.

shwina requested review from a team as code owners February 4, 2026 16:01

github-project-automation bot added this to CCCL Feb 4, 2026

github-project-automation bot moved this to Todo in CCCL Feb 4, 2026

shwina requested review from ericniebler and oleksandr-pavlyk February 4, 2026 16:01

cccl-authenticator-app bot moved this from Todo to In Review in CCCL Feb 4, 2026

shwina force-pushed the add-stateful-ops branch from 31d045d to a094bde Compare February 4, 2026 16:53

shwina commented Feb 4, 2026

View reviewed changes

python/cuda_cccl/cuda/compute/_jit.py Outdated

class _StatelessOp(OpAdapter):

"""Adapter for stateles callables."""

Copy link

Contributor Author

shwina Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

shwina commented Feb 4, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

shwina added 8 commits February 4, 2026 14:28

Add infrastructure to handle stateful ops using native CCCL.c support

42127eb

Add infrastructure to handle stateful ops using native CCCL.c support

6c5f615

Many algorithms don't actually store op

980e114

Update op state in select and transform

e5e7207

All algorithms now pass op to __call__

7ea5973

Additional stateful tests

f06ddfe

Update docs, pyproject.toml

de2303d

Update tests and benchmarks

9e8966d

NaderAlAwar approved these changes Feb 4, 2026

View reviewed changes

shwina added 4 commits February 4, 2026 14:29

Lint

a3ba824

Update remaining algos

ea99277

state goes first

15d32cc

Fix test for same bytecode different shape

55e82dc

shwina force-pushed the add-stateful-ops branch from a094bde to 55e82dc Compare February 4, 2026 20:36

shwina mentioned this pull request Feb 4, 2026

cuda.compute: (refactoring) Change OpAdapter from a base class to a protocol (Compilable) #7503

Open

Update comment

8a8aacc

shwina enabled auto-merge (squash) February 4, 2026 21:13

shwina merged commit 246bb41 into NVIDIA:main Feb 4, 2026
74 of 75 checks passed

github-project-automation bot moved this from In Review to Done in CCCL Feb 4, 2026



		class _StatelessOp(OpAdapter):
		"""Adapter for stateles callables."""


		Then the transformed function will be:

		def func(x, state): return x + state[0]

Conversation

shwina commented Feb 4, 2026

Description

Performance

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

NaderAlAwar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 4, 2026

🥳 CI Workflow Results

🟩 Finished in 2h 21m: Pass: 100%/56 | Total: 19h 22m | Max: 1h 05m

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants