Skip to content

Clarify rehash setting in hash utils #20150

@Jefffrey

Description

@Jefffrey

Is your feature request related to a problem or challenge?

In our hash util functions, we have a rehash argument across many of them:

fn hash_array_primitive<T>(
array: &PrimitiveArray<T>,
random_state: &RandomState,
hashes_buffer: &mut [u64],
rehash: bool,
) where

fn hash_array<T>(
array: &T,
random_state: &RandomState,
hashes_buffer: &mut [u64],
rehash: bool,
) where

fn hash_string_view_array_inner<
T: ByteViewType,
const HAS_NULLS: bool,
const HAS_BUFFERS: bool,
const REHASH: bool,
>(

It's not clearly obvious why we do this from the code alone; it seems it used to be named multi_col and would be true if we needed to hash multiple columns, but was changed in #6816 to also skip rehash if it is the first column, for performance reasons.

  • It seems dictionary function also still calls it multi_col

I also found it confusing how certain hash functions don't have a rehash parameter; specifically the nested types such as list, struct, etc.

fn hash_struct_array(
array: &StructArray,
random_state: &RandomState,
hashes_buffer: &mut [u64],
) -> Result<()> {

fn hash_map_array(
array: &MapArray,
random_state: &RandomState,
hashes_buffer: &mut [u64],
) -> Result<()> {

fn hash_list_array<OffsetSize>(
array: &GenericListArray<OffsetSize>,
random_state: &RandomState,
hashes_buffer: &mut [u64],
) -> Result<()>
where

Describe the solution you'd like

Add some documentation explaining why we have a rehash parameter across the functions. Also look into adding rehash parameter for those hash functions missing them. If this parameter was omitted on purpose for such functions, leave an explanation of why this is the case.

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions