Expose virtual columns from the Arrow Parquet reader in datasource-parquet by jkylling · Pull Request #20133 · apache/datafusion

jkylling · 2026-02-03T11:48:37Z

Related to Add with_virtual_columns to ParquetSource for reading virtual columns #20132

It would be useful to expose the virtual columns of the arrow Parquet reader in the datasource-parquet ParquetSource we added in apache/arrow-rs#8715. Then engines can use both DataFusion's partition value machinery and the virtual columns. I made a go at it in this PR, but hit some rough edges. This is closer to an issue than a PR, but it is easier to explain with code.

The virtual columns we added are a bit difficult to integrate cleanly today. They are part of the physical schema of the Parquet reader, but cannot currently be projected. We need some additional handling to avoid predicate pushdown for virtual columns, to build the correct projection mask, and to build the correct stream schema. See the changes to opener.rs in this PR.

One alternative would be to modify the arrow-rs implementation to remove these workarounds. Then the only change to opener.rs would be .with_virtual_columns(virtual_columns.to_vec())? (and maybe even that could be avoided? See the discussion below).

What would be the best way forward here?

Related to #20132

Aside on `.with_virtual_columns`

It is redundant that the user needs to specify both Field::new("row_index", DataType::Int64, false).with_extension_type(RowNumber), and add the column in a special way to the reader options with .with_virtual_columns(virtual_columns.to_vec())?. When the extension type RowNumber is added, we know that it is a virtual column.

All users of the TableSchema/ParquetSource must know that a schema is built out of three parts: the physical Parquet columns, the virtual columns and the partition columns. From a user perspective, the user would just like to supply a schema.

One alternative is to only indicate the column kind using extension types, and the user only supplies a schema. That is, there would be an extension type indicating that a column is a partition column or virtual column, instead of the user supplying this information piecemeal. This may have a performance impact, as we would likely need to extract different extension type columns during planning, which could be problematic for large schemas.

…rquet It would be useful to expose the virtual columns of the arrow Parquet reader in the datasource-parquet `ParquetSource` added in apache/arrow-rs#8715. Then engines can use both DataFusion's partition value machinery and the virtual columns. I made a go at it in this PR, but hit some rough edges. This is closer to an issue than a PR, but it is easier to explain with code. The virtual columns we added are a bit difficult to integrate cleanly today. They are part of the physical schema of the Parquet reader, but cannot currently be projected. We need some additional handling to avoid predicate pushdown for virtual columns, to build the correct projection mask, and to build the correct stream schema. See the changes to `opener.rs` in this PR. One alternative would be to modify the arrow-rs implementation to remove these workarounds. Then the only change to `opener.rs` would be `.with_virtual_columns(virtual_columns.to_vec())?` (and maybe even that could be avoided? See the discussion below). What would be the best way forward here? It is redundant that the user needs to specify both `Field::new("row_index", DataType::Int64, false).with_extension_type(RowNumber)`, and add the column in a special way to the reader options with `.with_virtual_columns(virtual_columns.to_vec())?`. When the extension type `RowNumber` is added, we know that it is a virtual column. All users of the `TableSchema/ParquetSource` must know that a schema is built out of three parts: the physical Parquet columns, the virtual columns and the partition columns. From a user perspective, the user would just like to supply a schema. One alternative is to only indicate the column kind using extension types, and the user only supplies a schema. That is, there would be an extension type indicating that a column is a partition column or virtual column, instead of the user supplying this information piecemeal. This may have a performance impact, as we would likely need to extract different extension type columns during planning, which could be problematic for large schemas. Signed-off-by: Jonas Irgens Kylling <jkylling@gmail.com>

jkylling · 2026-02-03T11:49:36Z

datafusion/datasource-parquet/src/opener.rs

+            assert_eq!(row_index_values, vec![2, 3, 4]);
+        }
+
+        // Test 2: Filter on virtual column does not have predicate pushdown


No filtering on virtual columns in the Parquet source.

jkylling · 2026-02-03T11:51:47Z

@alamb @vustef @scovich Would be great with your input on this!

alamb · 2026-02-03T14:31:11Z

What I recommend is that we figure out how the high level API will look like (aka how will someone query this via SQL and/or dataframe API)

Then we can expose the relevant APIs in the parquet reader and other datasources as appropriate

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose virtual columns from the Arrow Parquet reader in datasource-parquet#20133

Expose virtual columns from the Arrow Parquet reader in datasource-parquet#20133
jkylling wants to merge 1 commit intoapache:mainfrom
jkylling:feature/parquet_source_virtual_column

jkylling commented Feb 3, 2026 •

edited by alamb

Loading

Uh oh!

jkylling Feb 3, 2026

Uh oh!

jkylling commented Feb 3, 2026

Uh oh!

alamb commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jkylling commented Feb 3, 2026 • edited by alamb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Aside on .with_virtual_columns

Uh oh!

jkylling Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

jkylling commented Feb 3, 2026

Uh oh!

alamb commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jkylling commented Feb 3, 2026 •

edited by alamb

Loading

Aside on `.with_virtual_columns`