Skip to content

feat: Add CLP connector#13364

Closed
wraymo wants to merge 13 commits intofacebookincubator:mainfrom
y-scope:clp_integration
Closed

feat: Add CLP connector#13364
wraymo wants to merge 13 commits intofacebookincubator:mainfrom
y-scope:clp_integration

Conversation

@wraymo
Copy link
Contributor

@wraymo wraymo commented May 16, 2025

Overview

The current Presto–CLP connector PR introduces the coordinator-side implementation, along with a placeholder (dummy) worker implementation. Detailed information about the overall design is available in the corresponding RFC. This Velox PR focuses on the worker-side logic.

The Velox-CLP connector enables query execution on CLP archives. The Velox worker receives split information and the associated KQL query from the Presto coordinator. For each split, it executes the KQL query against the relevant CLP archive to find matching messages and stores their indices.

To support lazy evaluation, the implementation creates lazy vectors that wrap a CLP column reader and the list of matching indices. When accessed during query execution, these vectors load and decode only the necessary data on demand.

Core Classes

ClpDataSource

This class extends DataSource and implements the addSplit and next methods. During initialization, it records the KQL query and archive source (S3 or local), then traverses the output type to map Presto fields to CLP projection fields. Only ARRAY(VARCHAR) and primitive leaf fields like BIGINT, DOUBLE, BOOLEAN and VARCHAR are projected.

When a split is added, a ClpCursor is created with the archive path and input source. The query is parsed and simplified into an AST. On next, the cursor finds matching row indices and, if any exist, returns a row vector composed of lazy vectors, which load data as needed during execution.

ClpCursor

This class manages the execution of a query over a CLP-S archive. It handles parsing and validation, loading schemas and archives, setting up projection fields, and filtering results. In CLP-S, records are partitioned by schemas. ClpCursor uses ClpQueryRunner to initialize the execution context for each schema and evaluate the filters. It will skip archives where dictionary lookups for string filters return no matches and only scan the relevant schemas of a specific archive. For example, consider a log dataset with the following records.

{"a": "Hello", "b": 2}
{"a": "World", "b": 0, "c": false}
{"a": "World", "c": true}

The three log messages have varying schemas. If we run a KQL query a: World AND b: 0, it will skip loading the third message because it's schema does not match the query (there's no b field). And if the query is a: random AND b: 0, it will even skip scanning the first two records, because random cannot be found in the dictionary.

ClpQueryRunner

This class extends the generic CLP QueryRunner to support ordered projection and row filtering. It initializes projected column readers and returns filtered row indices for each batch.

ClpVectorLoader

In CLP, values are decoded and read from a BaseColumnReader. The ClpVectorLoader is custom Velox VectorLoader that loads vectors from CLP column readers. It supports integers, floats, booleans, strings, and arrays of strings. It's used by lazy vectors to load data on demand using the previously stored row indices.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 16, 2025
@netlify
Copy link

netlify bot commented May 16, 2025

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 83ff95f
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/688111b2ae613100082cd042

@wraymo wraymo force-pushed the clp_integration branch from 3dba21c to 72e96db Compare May 16, 2025 14:39
@wraymo wraymo force-pushed the clp_integration branch from 72e96db to afb071c Compare May 16, 2025 14:50
Comment on lines 15 to 37
set(CLP_EXTERNAL_BINARY_DIR ${CMAKE_BINARY_DIR}/external/clp)
add_subdirectory(${clp_SOURCE_DIR}/components/core/src/clp/string_utils
${CLP_EXTERNAL_BINARY_DIR}/string_utils)
set(YSTDLIB_CPP_BUILD_TESTING OFF)
add_subdirectory(${clp_SOURCE_DIR}/components/core/submodules/ystdlib-cpp
${CLP_EXTERNAL_BINARY_DIR}/ystdlib-cpp EXCLUDE_FROM_ALL)

string(LENGTH "${CMAKE_SOURCE_DIR}/" SOURCE_PATH_SIZE)

antlr_target(
KqlParser
${CLP_SRC_DIR}/clp_s/search/kql/Kql.g4
LEXER
PARSER
VISITOR
PACKAGE
kql)

set(CLP_SRC_FILES
${ANTLR_KqlParser_CXX_OUTPUTS}
${CLP_SRC_DIR}/clp_s/ArchiveReader.cpp
${CLP_SRC_DIR}/clp_s/ArchiveReaderAdaptor.cpp
${CLP_SRC_DIR}/clp_s/ColumnReader.cpp
Copy link
Collaborator

@assignUser assignUser May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what's happening here? This seems like vendoring with extra steps. (steps that add a bunch of dependencies...)

To be specific: Why do we have to generate the kql parser with ANTLR and take on that dependency and then make the clp sources part of our targets? Is there no normal clp target that provides the functionality including the parser?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about it. We are doing some refactoring work and fixing CI failures. I'll get back to you when this PR is ready.

@wraymo wraymo marked this pull request as draft May 16, 2025 18:36
@stale
Copy link

stale bot commented Oct 22, 2025

This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions!

@stale stale bot added the stale label Oct 22, 2025
@stale stale bot closed this Nov 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants