-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Open
Description
Motivation
When using bulk insert with very large datasets (e.g., billions of records, many TB of data), the existing sort modes can cause OOM issues:
- GLOBAL_SORT: Requires a global shuffle that can cause OOM when sorting large datasets with many files (e.g., 2800+ files each ~1GB)
- PARTITION_SORT: Provides suboptimal file sizes and doesn't help with record key ordering across partitions
- NONE: Fast but provides no data ordering benefits for query predicate pushdown
Users with 5TB+ datasets and 60+ billion records (see #12116) have reported OOM issues with all existing sort modes.
Proposed Solution
Add a new RANGE_BUCKET sort mode that:
- Uses quantile-based bucketing: Leverages Spark's
approxQuantile()algorithm to compute range boundaries with minimal memory overhead - Assigns records to buckets: Records are assigned to buckets based on value ranges of a specified column (defaulting to record key)
- Handles skew: Automatically detects skewed partitions and redistributes to sub-partitions
- Optional within-partition sorting: Can sort records within each bucket for better compression
Benefits
- Memory efficient: Avoids expensive global shuffle operations that cause OOM
- Better query performance: Tight value ranges in parquet footers enable effective predicate pushdown
- Better compression: Adjacent records (in sorted order) are colocated, enabling ~30% better compression with dictionary/RLE encoding
- Configurable: Supports custom partitioning columns, skew detection thresholds, and within-partition sorting
Proposed Configuration Options
| Config | Description |
|---|---|
hoodie.bulkinsert.sort.mode=RANGE_BUCKET |
Enable range bucket mode |
hoodie.bulkinsert.range.bucket.partitioning.column |
Column to use for bucketing (defaults to record key) |
hoodie.bulkinsert.range.bucket.partitioning.sort.within.partition |
Enable within-partition sorting |
hoodie.bulkinsert.range.bucket.partitioning.skew.multiplier |
Threshold for skew detection |
hoodie.bulkinsert.range.bucket.partitioning.quantile.accuracy |
Accuracy for quantile computation |
Use Case
This is particularly useful for:
- Initial bulk loading of very large datasets
- Tables with high-cardinality primary keys (UUIDs, etc.)
- Workloads where query patterns benefit from record key ordering
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels