chore: Clarify rehash setting in hash utils#20154
chore: Clarify rehash setting in hash utils#20154kumarUjjawal wants to merge 2 commits intoapache:mainfrom
Conversation
datafusion/common/src/hash_utils.rs
Outdated
| dict_hashes: &[u64], | ||
| dict_values: &dyn Array, | ||
| idx: usize, | ||
| // `multi_col` is the historical name for what is now referred to as `rehash` elsewhere |
There was a problem hiding this comment.
It's better to fix the naming than add a comment trying to explain the discrepency
There was a problem hiding this comment.
Made some changes, also did a minor refactor, there is another opportunity for the refactor to centralize the common “compute per-row nested hash → apply rehash (init vs combine) → hash null rows” logic in a shared helper, so each nested hasher mostly just computes a row_hash and delegates the buffer update/null handling. What do you think?
There was a problem hiding this comment.
I'd be careful of this approach, as we'd want to pull as many checks outside the hotloop as we can; checking rehash each time we compute a hash is inefficient compared to checking once before the loop, as many other functions here do
datafusion/common/src/hash_utils.rs
Outdated
| // buffer, it would change the numeric hash values produced for standalone | ||
| // Struct columns. |
There was a problem hiding this comment.
I think we should look into fixing this instead of leaving the reasoning as "keep existing behaviour", especially when we don't know the root cause of why this existing behaviour is the way it is
Which issue does this PR close?
rehashsetting in hash utils #20150Rationale for this change
rehash controls whether a column initializes per-row hashes or combines into an existing accumulator when
hashing multiple columns. Nested and dictionary hashing previously didn’t follow this convention consistently, making behavior harder to reason about.
What changes are included in this PR?
• Documented rehash semantics and how create_hashes applies it.
• Renamed dictionary hashing parameter multi_col to rehash for consistency.
• Updated nested-type hashers (list/struct/map/union/fixed-size-list) to accept and honor rehash, and made null
handling consistent with other types.
• Updated/added tests to reflect the corrected semantics.
Are these changes tested?
Yes
Are there any user-facing changes?
No