Add more struct pushdown tests and planning benchmark#20143
Add more struct pushdown tests and planning benchmark#20143adriangb merged 2 commits intoapache:mainfrom
Conversation
|
run benchmarks sql_planner |
|
🤖 Hi @adriangb, thanks for the request (#20143 (comment)).
Please choose one or more of these with |
|
run benchmark sql_planner |
|
🤖 |
|
show benchmark queue |
|
🤖 Hi @adriangb, you asked to view the benchmark queue (#20143 (comment)).
|
|
🤖: Benchmark completed Details
|
Confirms the benchmarks run correctly! |
| .unwrap() | ||
| } | ||
|
|
||
| /// Create a table provider with a struct column: `id` (Int32) and `props` (Struct { value: Int32, label: Utf8 }) |
| logical_plan | ||
| 01)Projection: get_field(simple_struct.s, Utf8("label")) | ||
| 02)--TableScan: simple_struct projection=[s] | ||
| physical_plan DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/projection_pushdown/simple.parquet]]}, projection=[get_field(s@1, label) as simple_struct.s[label]], file_type=parquet |
There was a problem hiding this comment.
so projection=[get_field(s@1, label) means the we extract label field from s structure column as earlier as possible, on the scan level?
There was a problem hiding this comment.
Yep! In this case it "just works" because there's no filter, etc. in the way
| physical_plan | ||
| 01)HashJoinExec: mode=CollectLeft, join_type=Inner, on=[(simple_struct.s[value]@2, join_right.s[level] * Int64(10)@2)], projection=[id@0, id@3] | ||
| 02)--DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/projection_pushdown/simple.parquet]]}, projection=[id, s, get_field(s@1, value) as simple_struct.s[value]], file_type=parquet | ||
| 03)--DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/projection_pushdown/join_right.parquet]]}, projection=[id, s, get_field(s@1, level) * 10 as join_right.s[level] * Int64(10)], file_type=parquet |
There was a problem hiding this comment.
🤔 its outside of this PR but
get_field(s@1, level) * 10 as join_right.s[level] * Int64(10)
looks confusing, I would expect just
get_field(s@1, level) * 10
or we wanna preserve projection column generated names then
get_field(s@1, level) * 10 as [join_right.s[level] * Int64(10)]
There was a problem hiding this comment.
I'm actually surprised this is getting pushed down into the scan here. I'm not sure what would cause that. It's not a bad thing but maybe we can evaluate if we should have the aliases there or not when we change this next time.
| 01)ProjectionExec: expr=[id@0 as id, get_field(s@1, label) as simple_struct.s[label], get_field(s@2, role) as join_right.s[role]] | ||
| 02)--HashJoinExec: mode=CollectLeft, join_type=Inner, on=[(id@0, id@0)], projection=[id@0, s@1, s@3] | ||
| 03)----DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/projection_pushdown/simple.parquet]]}, projection=[id, s], file_type=parquet | ||
| 04)----DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/projection_pushdown/join_right.parquet]]}, projection=[id, s], file_type=parquet, predicate=DynamicFilter [ empty ] |
There was a problem hiding this comment.
should we expect get_field pushdown here? or it is cheaper to bring the entire structure if many columns requested?
There was a problem hiding this comment.
Ideally we want pushdown here, it just doesn't work with the current status quo
Pulling out of #20117