[FEATURE] Add RANGE_BUCKET sort mode for bulk insert to handle large datasets without OOM

## Motivation

When using bulk insert with very large datasets (e.g., billions of records, many TB of data), the existing sort modes can cause OOM issues:

- **GLOBAL_SORT**: Requires a global shuffle that can cause OOM when sorting large datasets with many files (e.g., 2800+ files each ~1GB)
- **PARTITION_SORT**: Provides suboptimal file sizes and doesn't help with record key ordering across partitions
- **NONE**: Fast but provides no data ordering benefits for query predicate pushdown

Users with 5TB+ datasets and 60+ billion records (see #12116) have reported OOM issues with all existing sort modes.

## Proposed Solution

Add a new `RANGE_BUCKET` sort mode that:

1. **Uses quantile-based bucketing**: Leverages Spark's `approxQuantile()` algorithm to compute range boundaries with minimal memory overhead
2. **Assigns records to buckets**: Records are assigned to buckets based on value ranges of a specified column (defaulting to record key)
3. **Handles skew**: Automatically detects skewed partitions and redistributes to sub-partitions
4. **Optional within-partition sorting**: Can sort records within each bucket for better compression

### Benefits

- **Memory efficient**: Avoids expensive global shuffle operations that cause OOM
- **Better query performance**: Tight value ranges in parquet footers enable effective predicate pushdown
- **Better compression**: Adjacent records (in sorted order) are colocated, enabling ~30% better compression with dictionary/RLE encoding
- **Configurable**: Supports custom partitioning columns, skew detection thresholds, and within-partition sorting

### Proposed Configuration Options

| Config | Description |
|--------|-------------|
| `hoodie.bulkinsert.sort.mode=RANGE_BUCKET` | Enable range bucket mode |
| `hoodie.bulkinsert.range.bucket.partitioning.column` | Column to use for bucketing (defaults to record key) |
| `hoodie.bulkinsert.range.bucket.partitioning.sort.within.partition` | Enable within-partition sorting |
| `hoodie.bulkinsert.range.bucket.partitioning.skew.multiplier` | Threshold for skew detection |
| `hoodie.bulkinsert.range.bucket.partitioning.quantile.accuracy` | Accuracy for quantile computation |

## Use Case

This is particularly useful for:
- Initial bulk loading of very large datasets
- Tables with high-cardinality primary keys (UUIDs, etc.)
- Workloads where query patterns benefit from record key ordering

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Add RANGE_BUCKET sort mode for bulk insert to handle large datasets without OOM #18059

Motivation

Proposed Solution

Benefits

Proposed Configuration Options

Use Case

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Config	Description
`hoodie.bulkinsert.sort.mode=RANGE_BUCKET`	Enable range bucket mode
`hoodie.bulkinsert.range.bucket.partitioning.column`	Column to use for bucketing (defaults to record key)
`hoodie.bulkinsert.range.bucket.partitioning.sort.within.partition`	Enable within-partition sorting
`hoodie.bulkinsert.range.bucket.partitioning.skew.multiplier`	Threshold for skew detection
`hoodie.bulkinsert.range.bucket.partitioning.quantile.accuracy`	Accuracy for quantile computation

[FEATURE] Add RANGE_BUCKET sort mode for bulk insert to handle large datasets without OOM #18059

Description

Motivation

Proposed Solution

Benefits

Proposed Configuration Options

Use Case

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions