Skip to content

Conversation

@a10y
Copy link
Contributor

@a10y a10y commented Feb 2, 2026

This PR adds scaffolding for the first part of implementing export of Vortex Arrays directly to Arrow C Data Device Interface arrays, and implements the actual export for StructArray and PrimitiveArray

Setup

Arrow defines a standard for exchanging data in-process to native code using the C Data Interface, which converts an Arrow array into a struct of pointers.

There is an (experimental, but mostly stable) extension called the C Data Device Interface which includes extra information such as the GPU device type and device ID.

cudf knows how to import data exported this way using the from_arrow_device_column method for column_views.

We replicate the ArrowArray and ArrowDeviceArray structs from the spec.

These are very similar to the existing FFI_ArrowArray and FFI_ArrowSchema structs from the arrow-data crate, except that those use arrow Bytes and real pointers and we need to use CUDA device pointers and BufferHandles.

Testing

Rather than polluting the repo/CI with fetching and building cudf, I made a new repo https://github.com/vortex-data/cudf-test-harness/

This will load a cdylib, call its export_array() function, imports it into cudf and calls some simple functions, then frees it by calling the embedded release function pointer. This step is hooked up to CI now and only adds ~1m to the CUDA tests jobs.

@a10y a10y force-pushed the aduffy/c-device-iface branch from 9d62079 to 18f8a3d Compare February 2, 2026 20:05
@codspeed-hq
Copy link

codspeed-hq bot commented Feb 3, 2026

Merging this PR will not alter performance

✅ 1138 untouched benchmarks
⏩ 1265 skipped benchmarks1


Comparing aduffy/c-device-iface (a06184d) with develop (6b69f19)

Open in CodSpeed

Footnotes

  1. 1265 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@a10y a10y force-pushed the aduffy/c-device-iface branch from 3a8ddb6 to a2d1c95 Compare February 3, 2026 17:19
@a10y a10y added the changelog/feature A new feature label Feb 3, 2026
@a10y a10y changed the title WIP: initial work toward export to device interface feat[cuda]: export arrays to ArrowDeviceArray Feb 3, 2026
@a10y a10y force-pushed the aduffy/c-device-iface branch 2 times, most recently from 497d765 to 79b7afc Compare February 3, 2026 17:59
@a10y a10y marked this pull request as ready for review February 3, 2026 18:19
a10y added 10 commits February 3, 2026 13:28
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
@a10y a10y force-pushed the aduffy/c-device-iface branch from 74549dc to 930f824 Compare February 3, 2026 18:28
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
@a10y a10y force-pushed the aduffy/c-device-iface branch from 930f824 to b489712 Compare February 3, 2026 18:29
a10y added 2 commits February 3, 2026 13:30
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
@a10y a10y requested review from 0ax1 and joseph-isaacs February 3, 2026 18:49
a10y and others added 2 commits February 3, 2026 16:06
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Copy link
Contributor

@0ax1 0ax1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of questions / remarks

- name: Download and run cudf-test-harness
run: |
curl -fsSL https://github.com/vortex-data/cudf-test-harness/releases/latest/download/cudf-test-harness-x86_64.tar.gz | tar -xz
cd cudf-test-harness-x86_64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I basically try to never cd in shell scripts, considering the current pwd as state. obs not super important, naturally keep as is if you like.

run: |
curl -fsSL https://github.com/vortex-data/cudf-test-harness/releases/latest/download/cudf-test-harness-x86_64.tar.gz | tar -xz
cd cudf-test-harness-x86_64
./cudf-test-harness check $GITHUB_WORKSPACE/target/x86_64-unknown-linux-gnu/debug/libvortex_cudf_test.so
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we leave a comment here to briefly hint at what is happening here with the test harness and why it lives in a separate repo?

});

/// External array
#[unsafe(no_mangle)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could throw in more docs here. What does external mean here exactly, and is export_array part of the arrow device ABI etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is all just test code. I can comment to make that clearer


let null_count = match validity {
Validity::NonNullable | Validity::AllValid => 0,
Validity::AllInvalid => len,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Early return all invalid case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g we're doing

        if matches!(values.validity()?, Validity::AllInvalid) {
            return ConstantArray::new(Scalar::null(values.dtype().clone()), output_len)
                .to_canonical();
        }

in the runend case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was actually an oversight and I think to start with, I actually think we should avoid handling any nulls whatsoever for now

// We need the children to be held across await points.
let mut children = Vec::with_capacity(fields.len());

for field in fields.iter() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about sth along the lines of:

let futures = fields
    .iter()
    .map(|field| {
        let field = field.clone();
        async move {
            let cuda_field = field.execute_cuda(ctx).await?;
            export_canonical(cuda_field, ctx).await
        }
    })
    .collect::<Vec<_>>();

let results = join_all(futures).await;

If fields live on different streams, they run in parallel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are fields going to live on different streams? if so, things might get a little weird with the SyncEvent stuff

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically: the contract for the returned ArrowDeviceArray is that its sync_event field, if not NULL, contains a pointer to a cudaStream_t which AFAIU can only be linked to one stream.

Similarly the CudaExecutionCtx is linked to only one global stream rn. So assuming that StructArray::execute_cuda(ctx) gets called, all of its fields should be getting executed with the same context, and thus the same stream.

let buffer = if buffer.is_on_device() {
buffer
} else {
// TODO(aduffy): I don't think this type parameter does anything
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need T here to know what type is returned from cuda alloc.
let mut cuda_slice: CudaSlice<T> = self.device_alloc(host_slice.len())?;

Should we put a phantom data marker on BufferHandle instead?

for child in children {
release_array(child);
}
drop(private_data);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the explicit drop at the end of the scope?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess not strictly necessary but i like being explicit here. the private_data owns all of the resources allocated on the rust/cudarc side so dropping it signals that the array is actually freed here

}

unsafe impl Send for ArrowDeviceArray {}
unsafe impl Sync for ArrowDeviceArray {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can drop the Sync impl I think.


use crate::CudaExecutionCtx;

#[derive(Debug, Copy, Clone)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we gen this with bindgen?

Copy link
Contributor Author

@a10y a10y Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd prefer to leave bindgen out of this one, since ArrowArray and ArrowDeviceArray cannot be trivially bindgen'ed since they rely on some cudarc types.

I based these definitions off of the ones in https://github.com/apache/arrow-rs/blob/main/arrow-data/src/ffi.rs#L27, with the addition of device stuff

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put this inn vortex-test-e2e-cuda?

pub(crate) struct CanonicalDeviceArrayExport;

#[async_trait]
impl ExportDeviceArray for CanonicalDeviceArrayExport {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this logic every differ from export arrow with a different execute function?

/// event that the client must wait on.
#[repr(C)]
#[derive(Debug)]
pub struct ArrowDeviceArray {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use a regular arrow array instead of this?

@joseph-isaacs
Copy link
Contributor

Could we share logic between this exporter and the arrow export if we could pass around a different executor?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants