Is your feature request related to a problem? Please describe.
cudf::hash_partition is very delicate when the number of requested partitions is large enough that we don't dispatch to "optimized" kernels (num_partitions > 1024).
In the following ways:
- The approach of tracking assignment of rows to partitions allocates a vector that turns out to be sparse in this case (so it can be much larger than the number of input rows, leading to out of memory).
- Even if we get through that, the same sparse vector ends up with more than uint32::max values, and so we hit the usual 32bit offset thrust errors.
It would be great if we could lift these restrictions.