Skip to content

chore: Clarify rehash setting in hash utils#20154

Open
kumarUjjawal wants to merge 2 commits intoapache:mainfrom
kumarUjjawal:chore/clarify_rehash
Open

chore: Clarify rehash setting in hash utils#20154
kumarUjjawal wants to merge 2 commits intoapache:mainfrom
kumarUjjawal:chore/clarify_rehash

Conversation

@kumarUjjawal
Copy link
Contributor

@kumarUjjawal kumarUjjawal commented Feb 4, 2026

Which issue does this PR close?

Rationale for this change

rehash controls whether a column initializes per-row hashes or combines into an existing accumulator when
hashing multiple columns. Nested and dictionary hashing previously didn’t follow this convention consistently, making behavior harder to reason about.

What changes are included in this PR?

• Documented rehash semantics and how create_hashes applies it.
• Renamed dictionary hashing parameter multi_col to rehash for consistency.
• Updated nested-type hashers (list/struct/map/union/fixed-size-list) to accept and honor rehash, and made null
handling consistent with other types.
• Updated/added tests to reflect the corrected semantics.

Are these changes tested?

Yes

Are there any user-facing changes?

No

@github-actions github-actions bot added the common Related to common crate label Feb 4, 2026
dict_hashes: &[u64],
dict_values: &dyn Array,
idx: usize,
// `multi_col` is the historical name for what is now referred to as `rehash` elsewhere
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to fix the naming than add a comment trying to explain the discrepency

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some changes, also did a minor refactor, there is another opportunity for the refactor to centralize the common “compute per-row nested hash → apply rehash (init vs combine) → hash null rows” logic in a shared helper, so each nested hasher mostly just computes a row_hash and delegates the buffer update/null handling. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be careful of this approach, as we'd want to pull as many checks outside the hotloop as we can; checking rehash each time we compute a hash is inefficient compared to checking once before the loop, as many other functions here do

Comment on lines 480 to 481
// buffer, it would change the numeric hash values produced for standalone
// Struct columns.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should look into fixing this instead of leaving the reasoning as "keep existing behaviour", especially when we don't know the root cause of why this existing behaviour is the way it is

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Clarify rehash setting in hash utils

2 participants