-
Notifications
You must be signed in to change notification settings - Fork 130
feat[cuda]: export arrays to ArrowDeviceArray #6253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
9d62079 to
18f8a3d
Compare
Merging this PR will not alter performance
Comparing Footnotes
|
3a8ddb6 to
a2d1c95
Compare
497d765 to
79b7afc
Compare
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
74549dc to
930f824
Compare
930f824 to
b489712
Compare
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
0ax1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of questions / remarks
| - name: Download and run cudf-test-harness | ||
| run: | | ||
| curl -fsSL https://github.com/vortex-data/cudf-test-harness/releases/latest/download/cudf-test-harness-x86_64.tar.gz | tar -xz | ||
| cd cudf-test-harness-x86_64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I basically try to never cd in shell scripts, considering the current pwd as state. obs not super important, naturally keep as is if you like.
| run: | | ||
| curl -fsSL https://github.com/vortex-data/cudf-test-harness/releases/latest/download/cudf-test-harness-x86_64.tar.gz | tar -xz | ||
| cd cudf-test-harness-x86_64 | ||
| ./cudf-test-harness check $GITHUB_WORKSPACE/target/x86_64-unknown-linux-gnu/debug/libvortex_cudf_test.so |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we leave a comment here to briefly hint at what is happening here with the test harness and why it lives in a separate repo?
| }); | ||
|
|
||
| /// External array | ||
| #[unsafe(no_mangle)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Could throw in more docs here. What does external mean here exactly, and is export_array part of the arrow device ABI etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is all just test code. I can comment to make that clearer
|
|
||
| let null_count = match validity { | ||
| Validity::NonNullable | Validity::AllValid => 0, | ||
| Validity::AllInvalid => len, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Early return all invalid case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E.g we're doing
if matches!(values.validity()?, Validity::AllInvalid) {
return ConstantArray::new(Scalar::null(values.dtype().clone()), output_len)
.to_canonical();
}
in the runend case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was actually an oversight and I think to start with, I actually think we should avoid handling any nulls whatsoever for now
| // We need the children to be held across await points. | ||
| let mut children = Vec::with_capacity(fields.len()); | ||
|
|
||
| for field in fields.iter() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about sth along the lines of:
let futures = fields
.iter()
.map(|field| {
let field = field.clone();
async move {
let cuda_field = field.execute_cuda(ctx).await?;
export_canonical(cuda_field, ctx).await
}
})
.collect::<Vec<_>>();
let results = join_all(futures).await;
If fields live on different streams, they run in parallel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are fields going to live on different streams? if so, things might get a little weird with the SyncEvent stuff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifically: the contract for the returned ArrowDeviceArray is that its sync_event field, if not NULL, contains a pointer to a cudaStream_t which AFAIU can only be linked to one stream.
Similarly the CudaExecutionCtx is linked to only one global stream rn. So assuming that StructArray::execute_cuda(ctx) gets called, all of its fields should be getting executed with the same context, and thus the same stream.
| let buffer = if buffer.is_on_device() { | ||
| buffer | ||
| } else { | ||
| // TODO(aduffy): I don't think this type parameter does anything |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need T here to know what type is returned from cuda alloc.
let mut cuda_slice: CudaSlice<T> = self.device_alloc(host_slice.len())?;
Should we put a phantom data marker on BufferHandle instead?
| for child in children { | ||
| release_array(child); | ||
| } | ||
| drop(private_data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the explicit drop at the end of the scope?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess not strictly necessary but i like being explicit here. the private_data owns all of the resources allocated on the rust/cudarc side so dropping it signals that the array is actually freed here
| } | ||
|
|
||
| unsafe impl Send for ArrowDeviceArray {} | ||
| unsafe impl Sync for ArrowDeviceArray {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can drop the Sync impl I think.
|
|
||
| use crate::CudaExecutionCtx; | ||
|
|
||
| #[derive(Debug, Copy, Clone)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we gen this with bindgen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd prefer to leave bindgen out of this one, since ArrowArray and ArrowDeviceArray cannot be trivially bindgen'ed since they rely on some cudarc types.
I based these definitions off of the ones in https://github.com/apache/arrow-rs/blob/main/arrow-data/src/ffi.rs#L27, with the addition of device stuff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put this inn vortex-test-e2e-cuda?
| pub(crate) struct CanonicalDeviceArrayExport; | ||
|
|
||
| #[async_trait] | ||
| impl ExportDeviceArray for CanonicalDeviceArrayExport { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would this logic every differ from export arrow with a different execute function?
| /// event that the client must wait on. | ||
| #[repr(C)] | ||
| #[derive(Debug)] | ||
| pub struct ArrowDeviceArray { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use a regular arrow array instead of this?
|
Could we share logic between this exporter and the arrow export if we could pass around a different executor? |
This PR adds scaffolding for the first part of implementing export of Vortex Arrays directly to Arrow C Data Device Interface arrays, and implements the actual export for StructArray and PrimitiveArray
Setup
Arrow defines a standard for exchanging data in-process to native code using the C Data Interface, which converts an Arrow array into a struct of pointers.
There is an (experimental, but mostly stable) extension called the C Data Device Interface which includes extra information such as the GPU device type and device ID.
cudf knows how to import data exported this way using the
from_arrow_device_columnmethod for column_views.We replicate the
ArrowArrayandArrowDeviceArraystructs from the spec.These are very similar to the existing FFI_ArrowArray and FFI_ArrowSchema structs from the
arrow-datacrate, except that those use arrow Bytes and real pointers and we need to use CUDA device pointers and BufferHandles.Testing
Rather than polluting the repo/CI with fetching and building cudf, I made a new repo https://github.com/vortex-data/cudf-test-harness/
This will load a cdylib, call its
export_array()function, imports it into cudf and calls some simple functions, then frees it by calling the embeddedreleasefunction pointer. This step is hooked up to CI now and only adds ~1m to the CUDA tests jobs.