-
Notifications
You must be signed in to change notification settings - Fork 130
feat: implement DuckDB filesystem integration for Vortex file handling #6198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
vortex-duckdb/src/scan.rs
Outdated
| "http" | "https" | "s3" => { | ||
| let reader = open_duckdb_reader(client_ctx, &url); | ||
|
|
||
| // Fallback to the legacy object_store path for s3 if DuckDB fs isn't configured. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when will that happen? I would really love to get rid of object_store here, it has some underlying dependencies on tokio that can cause some weird issues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HTTP/HTTPS/S3 now route solely through DuckDB FS; legacy object_store fallback removed, and failures now bubble as errors if httpfs ext of DuckDB is not initialized.
CodSpeed Performance ReportMerging this PR will not alter performanceComparing Summary
Footnotes
|
0ax1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking the time to look into this. There's a couple of issues around thread safety / locking, error handling and memory leaks.
If you used AI to assist in writing the code for your PR please mention this in your PR description (see: https://github.com/vortex-data/vortex/blob/develop/CONTRIBUTING.md).
| if (!error_out) { | ||
| return; | ||
| } | ||
| *error_out = duckdb_vx_error_create(message.data(), message.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does this error get freed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C++ allocates errors via duckdb_vx_error_create; Rust side consistently frees with duckdb_vx_error_free in fs_error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest moving
static void SetError(duckdb_vx_error *error_out, const std::string &message) {
if (!error_out) {
return;
}
*error_out = duckdb_vx_error_create(message.data(), message.size());
}
static duckdb_state HandleException(std::exception_ptr ex, duckdb_vx_error *error_out) {
if (!ex) {
SetError(error_out, "Unknown error");
return DuckDBError;
}
try {
std::rethrow_exception(ex);
} catch (const Exception &caught) {
SetError(error_out, caught.what());
} catch (const std::exception &caught) {
SetError(error_out, caught.what());
} catch (...) {
SetError(error_out, "Unknown error");
}
return DuckDBError;
}
to error.cpp as they're not specific to file system logic.
vortex-duckdb/src/copy.rs
Outdated
| pub struct VortexCopyFunction; | ||
|
|
||
| #[derive(Clone, Copy)] | ||
| struct SendableClientCtx(usize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we encode the pointer as a usize?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replaced usize casting with NonNull<duckdb_vx_client_context_> (SendableClientCtx) and pass the raw pointer via as_ptr(). This avoids truncation/aliasing issues from integer casts and preserves provenance; all FFI touchpoints now use the typed wrapper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we use duckdb_vx_client_context? We typedef pointers to make them typesafe: typedef struct duckdb_vx_client_context_ *duckdb_vx_client_context;.
| pub struct VortexCopyFunction; | ||
|
|
||
| #[derive(Clone, Copy)] | ||
| struct SendableClientCtx(NonNull<cpp::duckdb_vx_client_context_>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's 2 definitions of SendableClientCtx and there's an existing ClientContext definition in vortex-duckdb/src/duckdb/client_context.rs. Implement the SendableClientCtx functionality on the existing ClientContext.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you add send + sync to ClientContext in client_context.rs add an explanation to the SAFETY comments why it is thread safe https://github.com/duckdb/duckdb/blob/e5fb0a7eab6bc6fd7f26bd5b8cd74c3441f3895b/src/include/duckdb/main/client_context.hpp#L67.
| // SPDX-FileCopyrightText: Copyright the Vortex contributors | ||
|
|
||
| use futures::StreamExt; | ||
| #[cfg(test)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of glob.rs is unused at this point. We can drop the whole file.
| const DEFAULT_CONCURRENCY: usize = 64; | ||
|
|
||
| lifetime_wrapper!(FsFileHandle, cpp::duckdb_vx_file_handle, cpp::duckdb_vx_fs_close, [owned, ref]); | ||
| unsafe impl Send for FsFileHandle {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked up in the meantime whether DuckDB file handles are thread safe. There's no synchronization happening in FileHandle: https://github.com/duckdb/duckdb/blob/e5fb0a7eab6bc6fd7f26bd5b8cd74c3441f3895b/src/include/duckdb/common/file_system.hpp#L71
It is though possible to use parallel reads/writes. The requirement is that all operations are position based Read(buffer, size, location) Write(buffer, size, location) and not cursor based and therefore stateful.
The write therefore needs to change away from the Cursor based approach.
wrapper->handle->Seek(offset);
wrapper->handle->Write(const_cast<uint8_t *>(buffer), len);
FILE_FLAGS_PARALLEL_ACCESS: https://github.com/duckdb/duckdb/blob/e5fb0a7eab6bc6fd7f26bd5b8cd74c3441f3895b/src/include/duckdb/common/file_open_flags.hpp#L117.
Please double check all operations on FSFileHandle whether they statefully change the handle. And make sure to use FILE_FLAGS_PARALLEL_ACCESS.
| } | ||
|
|
||
| /// A VortexReadAt implementation backed by DuckDB's filesystem (e.g., httpfs/s3). | ||
| pub struct DuckDbFsReadAt { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about renaming this to DuckDbFsReader?
| } | ||
| } | ||
|
|
||
| // SAFETY: DuckDB file handles are thread-safe for reads; writes are done via DuckDbFsWriter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to adjust these SAFETY comments to reflect under which conditions the handle is safe to use across threads. Please double check my previous statements in ref with the DuckDB source code: https://github.com/duckdb/duckdb/blob/e5fb0a7eab6bc6fd7f26bd5b8cd74c3441f3895b/src/include/duckdb/common/file_system.hpp#L71.
I tested it locally