Expose virtual columns from the Arrow Parquet reader in datasource-parquet#20133
Draft
jkylling wants to merge 1 commit intoapache:mainfrom
Draft
Expose virtual columns from the Arrow Parquet reader in datasource-parquet#20133jkylling wants to merge 1 commit intoapache:mainfrom
jkylling wants to merge 1 commit intoapache:mainfrom
Conversation
…rquet It would be useful to expose the virtual columns of the arrow Parquet reader in the datasource-parquet `ParquetSource` added in apache/arrow-rs#8715. Then engines can use both DataFusion's partition value machinery and the virtual columns. I made a go at it in this PR, but hit some rough edges. This is closer to an issue than a PR, but it is easier to explain with code. The virtual columns we added are a bit difficult to integrate cleanly today. They are part of the physical schema of the Parquet reader, but cannot currently be projected. We need some additional handling to avoid predicate pushdown for virtual columns, to build the correct projection mask, and to build the correct stream schema. See the changes to `opener.rs` in this PR. One alternative would be to modify the arrow-rs implementation to remove these workarounds. Then the only change to `opener.rs` would be `.with_virtual_columns(virtual_columns.to_vec())?` (and maybe even that could be avoided? See the discussion below). What would be the best way forward here? It is redundant that the user needs to specify both `Field::new("row_index", DataType::Int64, false).with_extension_type(RowNumber)`, and add the column in a special way to the reader options with `.with_virtual_columns(virtual_columns.to_vec())?`. When the extension type `RowNumber` is added, we know that it is a virtual column. All users of the `TableSchema/ParquetSource` must know that a schema is built out of three parts: the physical Parquet columns, the virtual columns and the partition columns. From a user perspective, the user would just like to supply a schema. One alternative is to only indicate the column kind using extension types, and the user only supplies a schema. That is, there would be an extension type indicating that a column is a partition column or virtual column, instead of the user supplying this information piecemeal. This may have a performance impact, as we would likely need to extract different extension type columns during planning, which could be problematic for large schemas. Signed-off-by: Jonas Irgens Kylling <jkylling@gmail.com>
jkylling
commented
Feb 3, 2026
| assert_eq!(row_index_values, vec![2, 3, 4]); | ||
| } | ||
|
|
||
| // Test 2: Filter on virtual column does not have predicate pushdown |
Author
There was a problem hiding this comment.
No filtering on virtual columns in the Parquet source.
Author
Contributor
|
What I recommend is that we figure out how the high level API will look like (aka how will someone query this via SQL and/or dataframe API) Then we can expose the relevant APIs in the parquet reader and other datasources as appropriate See also the discussion on |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
It would be useful to expose the virtual columns of the arrow Parquet reader in the datasource-parquet
ParquetSourcewe added in apache/arrow-rs#8715. Then engines can use both DataFusion's partition value machinery and the virtual columns. I made a go at it in this PR, but hit some rough edges. This is closer to an issue than a PR, but it is easier to explain with code.The virtual columns we added are a bit difficult to integrate cleanly today. They are part of the physical schema of the Parquet reader, but cannot currently be projected. We need some additional handling to avoid predicate pushdown for virtual columns, to build the correct projection mask, and to build the correct stream schema. See the changes to
opener.rsin this PR.One alternative would be to modify the arrow-rs implementation to remove these workarounds. Then the only change to
opener.rswould be.with_virtual_columns(virtual_columns.to_vec())?(and maybe even that could be avoided? See the discussion below).What would be the best way forward here?
Related to #20132
Aside on
.with_virtual_columnsIt is redundant that the user needs to specify both
Field::new("row_index", DataType::Int64, false).with_extension_type(RowNumber), and add the column in a special way to the reader options with.with_virtual_columns(virtual_columns.to_vec())?. When the extension typeRowNumberis added, we know that it is a virtual column.All users of the
TableSchema/ParquetSourcemust know that a schema is built out of three parts: the physical Parquet columns, the virtual columns and the partition columns. From a user perspective, the user would just like to supply a schema.One alternative is to only indicate the column kind using extension types, and the user only supplies a schema. That is, there would be an extension type indicating that a column is a partition column or virtual column, instead of the user supplying this information piecemeal. This may have a performance impact, as we would likely need to extract different extension type columns during planning, which could be problematic for large schemas.