-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Is your feature request related to a problem or challenge?
In our hash util functions, we have a rehash argument across many of them:
datafusion/datafusion/common/src/hash_utils.rs
Lines 185 to 190 in b80bf2c
| fn hash_array_primitive<T>( | |
| array: &PrimitiveArray<T>, | |
| random_state: &RandomState, | |
| hashes_buffer: &mut [u64], | |
| rehash: bool, | |
| ) where |
datafusion/datafusion/common/src/hash_utils.rs
Lines 230 to 235 in b80bf2c
| fn hash_array<T>( | |
| array: &T, | |
| random_state: &RandomState, | |
| hashes_buffer: &mut [u64], | |
| rehash: bool, | |
| ) where |
datafusion/datafusion/common/src/hash_utils.rs
Lines 282 to 287 in b80bf2c
| fn hash_string_view_array_inner< | |
| T: ByteViewType, | |
| const HAS_NULLS: bool, | |
| const HAS_BUFFERS: bool, | |
| const REHASH: bool, | |
| >( |
It's not clearly obvious why we do this from the code alone; it seems it used to be named multi_col and would be true if we needed to hash multiple columns, but was changed in #6816 to also skip rehash if it is the first column, for performance reasons.
- It seems dictionary function also still calls it
multi_col
I also found it confusing how certain hash functions don't have a rehash parameter; specifically the nested types such as list, struct, etc.
datafusion/datafusion/common/src/hash_utils.rs
Lines 447 to 451 in b80bf2c
| fn hash_struct_array( | |
| array: &StructArray, | |
| random_state: &RandomState, | |
| hashes_buffer: &mut [u64], | |
| ) -> Result<()> { |
datafusion/datafusion/common/src/hash_utils.rs
Lines 475 to 479 in b80bf2c
| fn hash_map_array( | |
| array: &MapArray, | |
| random_state: &RandomState, | |
| hashes_buffer: &mut [u64], | |
| ) -> Result<()> { |
datafusion/datafusion/common/src/hash_utils.rs
Lines 510 to 515 in b80bf2c
| fn hash_list_array<OffsetSize>( | |
| array: &GenericListArray<OffsetSize>, | |
| random_state: &RandomState, | |
| hashes_buffer: &mut [u64], | |
| ) -> Result<()> | |
| where |
Describe the solution you'd like
Add some documentation explaining why we have a rehash parameter across the functions. Also look into adding rehash parameter for those hash functions missing them. If this parameter was omitted on purpose for such functions, leave an explanation of why this is the case.
Describe alternatives you've considered
No response
Additional context
No response