Conversation
✅ Deploy Preview for meta-velox canceled.
|
| set(CLP_EXTERNAL_BINARY_DIR ${CMAKE_BINARY_DIR}/external/clp) | ||
| add_subdirectory(${clp_SOURCE_DIR}/components/core/src/clp/string_utils | ||
| ${CLP_EXTERNAL_BINARY_DIR}/string_utils) | ||
| set(YSTDLIB_CPP_BUILD_TESTING OFF) | ||
| add_subdirectory(${clp_SOURCE_DIR}/components/core/submodules/ystdlib-cpp | ||
| ${CLP_EXTERNAL_BINARY_DIR}/ystdlib-cpp EXCLUDE_FROM_ALL) | ||
|
|
||
| string(LENGTH "${CMAKE_SOURCE_DIR}/" SOURCE_PATH_SIZE) | ||
|
|
||
| antlr_target( | ||
| KqlParser | ||
| ${CLP_SRC_DIR}/clp_s/search/kql/Kql.g4 | ||
| LEXER | ||
| PARSER | ||
| VISITOR | ||
| PACKAGE | ||
| kql) | ||
|
|
||
| set(CLP_SRC_FILES | ||
| ${ANTLR_KqlParser_CXX_OUTPUTS} | ||
| ${CLP_SRC_DIR}/clp_s/ArchiveReader.cpp | ||
| ${CLP_SRC_DIR}/clp_s/ArchiveReaderAdaptor.cpp | ||
| ${CLP_SRC_DIR}/clp_s/ColumnReader.cpp |
There was a problem hiding this comment.
Can you explain what's happening here? This seems like vendoring with extra steps. (steps that add a bunch of dependencies...)
To be specific: Why do we have to generate the kql parser with ANTLR and take on that dependency and then make the clp sources part of our targets? Is there no normal clp target that provides the functionality including the parser?
There was a problem hiding this comment.
Sorry about it. We are doing some refactoring work and fixing CI failures. I'll get back to you when this PR is ready.
|
This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions! |
Overview
The current Presto–CLP connector PR introduces the coordinator-side implementation, along with a placeholder (dummy) worker implementation. Detailed information about the overall design is available in the corresponding RFC. This Velox PR focuses on the worker-side logic.
The Velox-CLP connector enables query execution on CLP archives. The Velox worker receives split information and the associated KQL query from the Presto coordinator. For each split, it executes the KQL query against the relevant CLP archive to find matching messages and stores their indices.
To support lazy evaluation, the implementation creates lazy vectors that wrap a CLP column reader and the list of matching indices. When accessed during query execution, these vectors load and decode only the necessary data on demand.
Core Classes
ClpDataSourceThis class extends
DataSourceand implements theaddSplitandnextmethods. During initialization, it records the KQL query and archive source (S3 or local), then traverses the output type to map Presto fields to CLP projection fields. OnlyARRAY(VARCHAR)and primitive leaf fields likeBIGINT,DOUBLE,BOOLEANandVARCHARare projected.When a split is added, a
ClpCursoris created with the archive path and input source. The query is parsed and simplified into an AST. Onnext, the cursor finds matching row indices and, if any exist, returns a row vector composed of lazy vectors, which load data as needed during execution.ClpCursorThis class manages the execution of a query over a CLP-S archive. It handles parsing and validation, loading schemas and archives, setting up projection fields, and filtering results. In CLP-S, records are partitioned by schemas.
ClpCursorusesClpQueryRunnerto initialize the execution context for each schema and evaluate the filters. It will skip archives where dictionary lookups for string filters return no matches and only scan the relevant schemas of a specific archive. For example, consider a log dataset with the following records.The three log messages have varying schemas. If we run a KQL query
a: World AND b: 0, it will skip loading the third message because it's schema does not match the query (there's nobfield). And if the query isa: random AND b: 0, it will even skip scanning the first two records, becauserandomcannot be found in the dictionary.ClpQueryRunnerThis class extends the generic CLP
QueryRunnerto support ordered projection and row filtering. It initializes projected column readers and returns filtered row indices for each batch.ClpVectorLoaderIn CLP, values are decoded and read from a
BaseColumnReader. TheClpVectorLoaderis custom VeloxVectorLoaderthat loads vectors from CLP column readers. It supports integers, floats, booleans, strings, and arrays of strings. It's used by lazy vectors to load data on demand using the previously stored row indices.