|
| 1 | +# Vectorized Execution Plan |
| 2 | + |
| 3 | +## Goal |
| 4 | +Implement vectorized execution to improve query performance by processing documents in batches. This improves CPU cache locality and enables the compiler to use SIMD instructions for filtering and projection. |
| 5 | + |
| 6 | +## Design |
| 7 | + |
| 8 | +### 1. Batch Structure |
| 9 | +We introduce a `Batch` struct that holds a chunk of data to be processed together. |
| 10 | + |
| 11 | +```rust |
| 12 | +pub struct Batch { |
| 13 | + // Row-oriented batching of LazyDocuments (ExecutionResult) |
| 14 | + pub items: Vec<ExecutionResult>, |
| 15 | +} |
| 16 | +``` |
| 17 | + |
| 18 | +### 2. Iterator Model |
| 19 | +Operators produce and consume `Batch` objects. |
| 20 | + |
| 21 | +- `BatchIterator` trait (effectively `Iterator<Item = Batch>`). |
| 22 | +- Batch size: Defaults to 4096, but adaptable via hints (e.g. for `LIMIT` queries). |
| 23 | + |
| 24 | +### 3. Vectorized Operators |
| 25 | + |
| 26 | +#### BatchScanOperator |
| 27 | +- Reads from the underlying `MergedIterator`. |
| 28 | +- Accumulates `N` items into a `Batch`. |
| 29 | +- **Optimization:** Accepts a `batch_size` parameter (derived from `LIMIT` hint) to avoid over-fetching data from storage. |
| 30 | + |
| 31 | +#### BatchFilterOperator (SIMD Target) |
| 32 | +- **Input:** `Batch` of `LazyDocument`s. |
| 33 | +- **Process:** |
| 34 | + 1. Check if the predicate is suitable for vectorization (currently Simple Binary Numeric expressions). |
| 35 | + 2. **Column Extraction:** Iterate through the batch and extract the specific field for all documents into a reusable typed buffer (`Vec<f64>`). |
| 36 | + - **Optimization:** Uses `evaluate_to_f64_lazy` (in `src/expression.rs`) to extract values directly from raw JSONB bytes without allocating intermediate `Value` enums or `BTreeMap`s. |
| 37 | + 3. **SIMD Evaluation:** Perform the comparison loop over the extracted vector. |
| 38 | + - Relies on Rust/LLVM auto-vectorization for tight loops over primitive arrays. |
| 39 | + 4. **Selection:** Filter the `Batch` in-place using the computed mask. |
| 40 | +- **Fallback:** If the predicate is complex (e.g. `OR`, nested paths, non-numeric), the execution planner falls back to the standard Row-based execution plan (`execute_row_plan`) to ensure no performance regression. |
| 41 | + |
| 42 | +#### BatchProjectOperator |
| 43 | +- Iterate over the `Batch`. |
| 44 | +- Apply projection to each item map-style. |
| 45 | +- (Currently disabled for automatic vectorization selection to ensure stability, effectively using Row-based plan for Projections). |
| 46 | + |
| 47 | +#### BatchLimitOperator / BatchOffsetOperator |
| 48 | +- Handle `LIMIT` and `OFFSET` directly on `Batch` streams to avoid switching contexts. |
| 49 | + |
| 50 | +### 4. Execution Strategy (`execute_plan`) |
| 51 | + |
| 52 | +The `execute_plan` function now intelligently chooses between Vectorized and Row-based execution: |
| 53 | + |
| 54 | +1. **Check Vectorizability:** Analyzes the logical plan. If the plan consists of Scan and Simple Numeric Filters (and optionally Limit/Offset), it qualifies for vectorization. |
| 55 | +2. **Vectorized Path (`execute_batch_plan`):** |
| 56 | + - Constructs a pipeline of `Batch*` operators. |
| 57 | + - Propagates `LIMIT` values as hints to `BatchScanOperator`. |
| 58 | + - **Disable Pushdown:** Deliberately avoids pushing the predicate down to the storage engine (`db.scan`) for these simple cases. This forces the data into `BatchFilterOperator` where the efficient SIMD loop and allocation-free extraction can outperform the storage engine's row-by-row check. |
| 59 | +3. **Row-based Path (`execute_row_plan`):** |
| 60 | + - Used for complex queries (logical operators, projections, complex paths). |
| 61 | + - Utilizes standard `ScanOperator`, `FilterOperator`, etc. |
| 62 | + - Leverages full predicate pushdown to `MergedIterator`. |
| 63 | + |
| 64 | +## Optimization Strategy |
| 65 | +- **Memory Reuse:** `BatchFilterOperator` reuses `buf_values` and `buf_valid` vectors between batches to avoid allocation churn. |
| 66 | +- **Allocation-Free Extraction:** `evaluate_to_f64_lazy` bypasses `Value` creation. |
| 67 | +- **Adaptive Batching:** `BatchScanOperator` scales batch size based on query limits. |
0 commit comments