Skip to content

[FEATURE] Add RANGE_BUCKET sort mode for bulk insert to handle large datasets without OOM #18059

@prashantwason

Description

@prashantwason

Motivation

When using bulk insert with very large datasets (e.g., billions of records, many TB of data), the existing sort modes can cause OOM issues:

  • GLOBAL_SORT: Requires a global shuffle that can cause OOM when sorting large datasets with many files (e.g., 2800+ files each ~1GB)
  • PARTITION_SORT: Provides suboptimal file sizes and doesn't help with record key ordering across partitions
  • NONE: Fast but provides no data ordering benefits for query predicate pushdown

Users with 5TB+ datasets and 60+ billion records (see #12116) have reported OOM issues with all existing sort modes.

Proposed Solution

Add a new RANGE_BUCKET sort mode that:

  1. Uses quantile-based bucketing: Leverages Spark's approxQuantile() algorithm to compute range boundaries with minimal memory overhead
  2. Assigns records to buckets: Records are assigned to buckets based on value ranges of a specified column (defaulting to record key)
  3. Handles skew: Automatically detects skewed partitions and redistributes to sub-partitions
  4. Optional within-partition sorting: Can sort records within each bucket for better compression

Benefits

  • Memory efficient: Avoids expensive global shuffle operations that cause OOM
  • Better query performance: Tight value ranges in parquet footers enable effective predicate pushdown
  • Better compression: Adjacent records (in sorted order) are colocated, enabling ~30% better compression with dictionary/RLE encoding
  • Configurable: Supports custom partitioning columns, skew detection thresholds, and within-partition sorting

Proposed Configuration Options

Config Description
hoodie.bulkinsert.sort.mode=RANGE_BUCKET Enable range bucket mode
hoodie.bulkinsert.range.bucket.partitioning.column Column to use for bucketing (defaults to record key)
hoodie.bulkinsert.range.bucket.partitioning.sort.within.partition Enable within-partition sorting
hoodie.bulkinsert.range.bucket.partitioning.skew.multiplier Threshold for skew detection
hoodie.bulkinsert.range.bucket.partitioning.quantile.accuracy Accuracy for quantile computation

Use Case

This is particularly useful for:

  • Initial bulk loading of very large datasets
  • Tables with high-cardinality primary keys (UUIDs, etc.)
  • Workloads where query patterns benefit from record key ordering

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions