This is the repository for the paper "Data Chunk Compaction in Vectorized Execution", accepted by SIGMOD'25.
The Supplementary Material of our paper includes three repositories:
- Problem formalization and simulation
- Some Microbenchmarks to compare various compaction strategies (Current Repository)
- Integrate the Leaning and Logical Compaction into the Duckdb, evaluating the End-to-end performance
Updates: The implementation of Logical Compaction has been successfully merged into DuckDB!
This repository contains code that we use in the microbenchmark section of the paper. It includes a vectorized execution engine that supports the hash join and the filter operators.
It implements several compaction strategies, including
- Logical Compaction
- Smart Compaction
C++17 or later is required, specifically:
- The code uses
<filesystem>header which requires C++17 - On older systems or compilers, you may need to use
<experimental/filesystem>instead - If you encounter compilation errors related to filesystem, try:
- For GCC < 8: Link with
-lstdc++fsand use#include <experimental/filesystem> - For older systems: Replace
std::filesystemwithstd::experimental::filesystem
- For GCC < 8: Link with
Tested on:
- GCC 8+ with C++17 support
- Clang 7+ with C++17 support
- Modern Linux distributions and macOS
We provide a compile script that can generate the executable file using the strategies
bash ./build_versions.sh
You can find the code of other compaction strategies in the other branch.
The generated executable files are placed in the folder compaction.
Usage: [program_name] [options]
Options:
--join-num [value] Number of joins
--chunk-factor [value] Chunk factor
--lhs-size [value] Size of LHS tuples
--rhs-size [value] Size of RHS tuples
--load-factor [value] Load factor
--payload-length=[list] Comma-separated list of payload lengths for RHS
Example: --payload-length=[0,1000,0,0]
Note for macOS/zsh users: If you encounter "no matches found" error, wrap the payload-length value in quotes:
--payload-length="[0,0,0,0]"Linux/bash:
./compaction/exe_logical_compaction --join-num 4 --chunk-factor 5 --lhs-size 20000000 --rhs-size 2000000 --load-factor 0.5 --payload-length=[0,0,0,0]macOS/zsh:
./compaction/exe_logical_compaction --join-num 4 --chunk-factor 5 --lhs-size 20000000 --rhs-size 2000000 --load-factor 0.5 --payload-length="[0,0,0,0]"Cross-platform (recommended):
./compaction/exe_logical_compaction --join-num 4 --chunk-factor 5 --lhs-size 20000000 --rhs-size 2000000 --load-factor 0.5 --payload-length="[0,0,0,0]"Expected output:
(base) yiming@golf:~/projects/compaction-project$ ./compaction/exe_logical_compaction --join-num 4 --chunk-factor 5 --lhs-size 20000000 --rhs-size 2000000 --load-factor 0.5 --payload-length="[0,0,0,0]"
------------------ Setting ------------------
Strategy: logical_compaction
Number of Joins: 4
Number of LHS Tuple: 20000000
Number of RHS Tuple: 2000000
Chunk Factor: 5
Load Factor: 0.5
RHS Payload Lengths: [0,0,0,0]
------------------ Statistic ------------------
[Total Time]: 4.07585s
-------
Total: 1.26864 s Calls: 48830 Avg: 2.59807e-05 s [Join - Next] 0x93854344831568
Total: 0.615136 s Calls: 72980 Avg: 8.42883e-06 s [Join - Next] 0x93858637800000
Total: 0.462437 s Calls: 82860 Avg: 5.58094e-06 s [Join - Next] 0x93858827360528
Total: 0.464402 s Calls: 90620 Avg: 5.12472e-06 s [Join - Next] 0x93859254106048
Total: 0.578376 s Calls: 9766 Avg: 5.92234e-05 s [Join - Probe] 0x93854344831568
Total: 0.281568 s Calls: 14596 Avg: 1.92908e-05 s [Join - Probe] 0x93858637800000
Total: 0.172811 s Calls: 16572 Avg: 1.04279e-05 s [Join - Probe] 0x93858827360528
Total: 0.129722 s Calls: 18709 Avg: 6.93368e-06 s [Join - Probe] 0x93859254106048
-------
If you encounter compilation errors related to <filesystem>, try the following:
For older GCC (< 8.0):
# Manually link filesystem library
g++ -std=c++17 -lstdc++fs your_files.cppFor older systems that don't support std::filesystem:
- Replace
#include <filesystem>with#include <experimental/filesystem>in:profiler.hnegative_feedback.hpp
- Replace
std::filesystemwithstd::experimental::filesystemin the same files - Link with
-lstdc++fsfor GCC or-lc++fsfor Clang
The CMakeLists.txt has been updated to automatically handle this for most cases.