feat: [ExternalTable Part4] Support data mapping for external collections#47730
feat: [ExternalTable Part4] Support data mapping for external collections#47730weiliu1031 wants to merge 5 commits intomilvus-io:masterfrom
Conversation
|
After Part3 is merged to master, please rebase this PR: git checkout part4_enable_data_mapping
git fetch upstream master
git rebase upstream/master
git push --force-with-leaseThis will clean up the commit history so only Part4-specific changes remain. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: weiliu1031 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@weiliu1031 This is a feature PR ( How to resolve: Design documents location: https://github.com/milvus-io/milvus-design-docs/tree/main/design_docs |
|
[ci-v2-notice] To rerun ci-v2 checks, comment with:
If you have any questions or requests, please contact @zhikunyao. |
|
@weiliu1031 Please associate the related issue to the body of your Pull Request. (eg. "issue: #") |
…ctions issue: milvus-io#45881 This change introduces manual refresh capability for external collections, allowing users to trigger on-demand data synchronization from external sources. It replaces the legacy update mechanism with a more robust job-task hierarchy and persistent state management. Key changes: - Add RefreshExternalCollection, GetRefreshExternalCollectionProgress, and ListRefreshExternalCollectionJobs APIs across Client, Proxy, and DataCoord - Implement ExternalCollectionRefreshManager to manage refresh jobs with a 1:N Job-Task hierarchy - Add ExternalCollectionRefreshMeta for persistent storage of jobs and tasks in the metastore - Add ExternalCollectionRefreshChecker for task state management and worker assignment - Implement ExternalCollectionRefreshInspector for periodic job cleanup - Use WAL Broadcast mechanism for distributed consistency and idempotency - Replace legacy external_collection_inspector and update tasks with the new refresh-based implementation - Add comprehensive unit tests for refresh job lifecycle and state transitions Signed-off-by: Wei Liu <wei.liu@zilliz.com>
…nality - Add RefreshExternalCollectionOption tests in client - Add util key building tests for external refresh jobs/tasks - Add KV catalog tests for external collection refresh operations - Add MixCoord tests for refresh methods (refresh, progress, list) - Add distributed service tests for refresh RPC endpoints - Add distributed client tests for refresh operations All tests validate correct behavior of external collection manual refresh feature introduced in part3 enhancement. Signed-off-by: Wei Liu <wei.liu@zilliz.com>
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
…n cache miss fix - Add RefreshExternalCollection, GetRefreshExternalCollectionProgress, ListRefreshExternalCollectionJobs RPC forwarding to mixcoord client, service and proxy service layers - Register ExternalCollection task type in AppendType/GetTaskType - Replace direct meta lookup with collectionGetter callback in refresh manager to handle collection cache miss race condition - Replace CreateExternalCollection implementation with no-op stub - Regenerate mock_mixcoord and mock_mixcoord_client via make generate-mockery - Add unit tests for properties, proxy forwarding and no-op stub Signed-off-by: Wei Liu <wei.liu@zilliz.com>
…ions Implement the data mapping pipeline that converts external data files (Parquet) into Milvus segments with proper column mapping. Key changes: - Pre-allocate segment IDs in DataCoord, pass to DataNode for direct final-path manifest writes (eliminating two-phase ID workflow) - Add FFI bridges for file exploration (ExploreFiles, GetFileInfo) and manifest creation (CreateManifestForSegment, ReadFragmentsFromManifest) - Implement fragment-to-segment balancing with configurable target rows - Add ExternalSpec parser for external data format configuration - Extend UpdateExternalCollectionRequest proto with schema, storage config, and pre-allocated segment ID fields - Add E2E test for external collection refresh with data verification Signed-off-by: Jiquan Long <jiquan.long@zilliz.com> Signed-off-by: Wei Liu <wei.liu@zilliz.com>
301efc0 to
88014cf
Compare
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #47730 +/- ##
===========================================
+ Coverage 74.12% 82.49% +8.37%
===========================================
Files 1452 614 -838
Lines 238922 98356 -140566
===========================================
- Hits 177091 81139 -95952
+ Misses 53686 17165 -36521
+ Partials 8145 52 -8093
🚀 New features to boost your workflow:
|
design doc: https://github.com/milvus-io/milvus-design-docs/blob/main/design_docs/20260105-external_table.md
issue: #45881
Summary
ExploreFiles,GetFileInfo) and manifest creation (CreateManifestForSegment,ReadFragmentsFromManifest)ExternalSpecparser for external data format configurationUpdateExternalCollectionRequestproto with schema, storage config, and pre-allocated segment ID fieldsTest plan
task_refresh_external_collection.go(28 tests)task_update.goand fragment utilities (40 tests)exttable_test.go, 9 tests)ExternalSpecparsermake lint-fixpasses