Skip to content

embryo-labs/Chunk-Compaction-in-Vectorized-Execution

Repository files navigation

Data Chunk Compaction in Vectorized Execution

This is the repository for the paper "Data Chunk Compaction in Vectorized Execution", accepted by SIGMOD'25.

The Supplementary Material of our paper includes three repositories:

  1. Problem formalization and simulation
  2. Some Microbenchmarks to compare various compaction strategies (Current Repository)
  3. Integrate the Leaning and Logical Compaction into the Duckdb, evaluating the End-to-end performance

Updates: The implementation of Logical Compaction has been successfully merged into DuckDB!


This repository contains code that we use in the microbenchmark section of the paper. It includes a vectorized execution engine that supports the hash join and the filter operators.

It implements several compaction strategies, including

  • Logical Compaction
  • Smart Compaction

System Requirements

C++17 or later is required, specifically:

  • The code uses <filesystem> header which requires C++17
  • On older systems or compilers, you may need to use <experimental/filesystem> instead
  • If you encounter compilation errors related to filesystem, try:
    • For GCC < 8: Link with -lstdc++fs and use #include <experimental/filesystem>
    • For older systems: Replace std::filesystem with std::experimental::filesystem

Tested on:

  • GCC 8+ with C++17 support
  • Clang 7+ with C++17 support
  • Modern Linux distributions and macOS

Building

We provide a compile script that can generate the executable file using the strategies

bash ./build_versions.sh

You can find the code of other compaction strategies in the other branch.

The generated executable files are placed in the folder compaction.

Usage: [program_name] [options]
Options:
    --join-num [value]        Number of joins
    --chunk-factor [value]    Chunk factor
    --lhs-size [value]        Size of LHS tuples
    --rhs-size [value]        Size of RHS tuples
    --load-factor [value]     Load factor
    --payload-length=[list]   Comma-separated list of payload lengths for RHS   
                                Example: --payload-length=[0,1000,0,0]

Note for macOS/zsh users: If you encounter "no matches found" error, wrap the payload-length value in quotes:

--payload-length="[0,0,0,0]"

Example:

Linux/bash:

./compaction/exe_logical_compaction --join-num 4 --chunk-factor 5 --lhs-size 20000000 --rhs-size 2000000 --load-factor 0.5 --payload-length=[0,0,0,0]

macOS/zsh:

./compaction/exe_logical_compaction --join-num 4 --chunk-factor 5 --lhs-size 20000000 --rhs-size 2000000 --load-factor 0.5 --payload-length="[0,0,0,0]"

Cross-platform (recommended):

./compaction/exe_logical_compaction --join-num 4 --chunk-factor 5 --lhs-size 20000000 --rhs-size 2000000 --load-factor 0.5 --payload-length="[0,0,0,0]"

Expected output:

(base) yiming@golf:~/projects/compaction-project$ ./compaction/exe_logical_compaction --join-num 4 --chunk-factor 5 --lhs-size 20000000 --rhs-size 2000000 --load-factor 0.5 --payload-length="[0,0,0,0]"
------------------ Setting ------------------
Strategy: logical_compaction
Number of Joins: 4
Number of LHS Tuple: 20000000
Number of RHS Tuple: 2000000
Chunk Factor: 5
Load Factor: 0.5
RHS Payload Lengths: [0,0,0,0]
------------------ Statistic ------------------
[Total Time]: 4.07585s
-------
Total: 1.26864 s        Calls: 48830    Avg: 2.59807e-05 s      [Join - Next] 0x93854344831568
Total: 0.615136 s       Calls: 72980    Avg: 8.42883e-06 s      [Join - Next] 0x93858637800000
Total: 0.462437 s       Calls: 82860    Avg: 5.58094e-06 s      [Join - Next] 0x93858827360528
Total: 0.464402 s       Calls: 90620    Avg: 5.12472e-06 s      [Join - Next] 0x93859254106048
Total: 0.578376 s       Calls: 9766     Avg: 5.92234e-05 s      [Join - Probe] 0x93854344831568
Total: 0.281568 s       Calls: 14596    Avg: 1.92908e-05 s      [Join - Probe] 0x93858637800000
Total: 0.172811 s       Calls: 16572    Avg: 1.04279e-05 s      [Join - Probe] 0x93858827360528
Total: 0.129722 s       Calls: 18709    Avg: 6.93368e-06 s      [Join - Probe] 0x93859254106048
-------

Troubleshooting

Filesystem Compilation Issues

If you encounter compilation errors related to <filesystem>, try the following:

For older GCC (< 8.0):

# Manually link filesystem library
g++ -std=c++17 -lstdc++fs your_files.cpp

For older systems that don't support std::filesystem:

  1. Replace #include <filesystem> with #include <experimental/filesystem> in:
    • profiler.h
    • negative_feedback.hpp
  2. Replace std::filesystem with std::experimental::filesystem in the same files
  3. Link with -lstdc++fs for GCC or -lc++fs for Clang

The CMakeLists.txt has been updated to automatically handle this for most cases.

About

[SIGMOD'25] We show the data chunk compaction problem in vectorized execution, and propose practical compaction solutions.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors