Skip to content

feat: [ExternalTable Part4] Support data mapping for external collections#47730

Open
weiliu1031 wants to merge 5 commits intomilvus-io:masterfrom
weiliu1031:part4_enable_data_mapping
Open

feat: [ExternalTable Part4] Support data mapping for external collections#47730
weiliu1031 wants to merge 5 commits intomilvus-io:masterfrom
weiliu1031:part4_enable_data_mapping

Conversation

@weiliu1031
Copy link
Contributor

@weiliu1031 weiliu1031 commented Feb 10, 2026

design doc: https://github.com/milvus-io/milvus-design-docs/blob/main/design_docs/20260105-external_table.md

issue: #45881

Summary

  • Pre-allocate segment IDs in DataCoord, pass to DataNode for direct final-path manifest writes (eliminating two-phase ID workflow)
  • Add FFI bridges for file exploration (ExploreFiles, GetFileInfo) and manifest creation (CreateManifestForSegment, ReadFragmentsFromManifest)
  • Implement fragment-to-segment balancing with configurable target rows per segment
  • Add ExternalSpec parser for external data format configuration
  • Extend UpdateExternalCollectionRequest proto with schema, storage config, and pre-allocated segment ID fields
  • Add E2E test for external collection refresh with data verification

Note: This PR includes Part3 changes (PR #47303). After Part3 is merged, this PR will be rebased to only contain Part4-specific changes.

Test plan

  • Unit tests for task_refresh_external_collection.go (28 tests)
  • Unit tests for task_update.go and fragment utilities (40 tests)
  • Unit tests for FFI bridges (exttable_test.go, 9 tests)
  • Unit tests for ExternalSpec parser
  • Unit tests for paramtable config
  • Integration test with real Parquet files
  • make lint-fix passes
  • E2E test with MinIO backend

@weiliu1031
Copy link
Contributor Author

⚠️ Dependency Note: This PR depends on Part3 (#47303). The current branch includes Part3 commits.

After Part3 is merged to master, please rebase this PR:

git checkout part4_enable_data_mapping
git fetch upstream master
git rebase upstream/master
git push --force-with-lease

This will clean up the commit history so only Part4-specific changes remain.

@sre-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: weiliu1031
To complete the pull request process, please assign liliu-z after the PR has been reviewed.
You can assign the PR to them by writing /assign @liliu-z in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sre-ci-robot sre-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines. label Feb 10, 2026
@sre-ci-robot sre-ci-robot added area/dependency Pull requests that update a dependency file area/internal-api area/test sig/testing labels Feb 10, 2026
@mergify mergify bot added the dco-passed DCO check passed. label Feb 10, 2026
@mergify
Copy link
Contributor

mergify bot commented Feb 10, 2026

@weiliu1031 This is a feature PR (feat:). Please provide a design document.

How to resolve:
Link a design doc in the PR description:

design doc: https://github.com/milvus-io/milvus-design-docs/blob/main/design_docs/your_design.md

Design documents location: https://github.com/milvus-io/milvus-design-docs/tree/main/design_docs

@mergify mergify bot added do-not-merge/missing-design-doc kind/feature Issues related to feature request from users labels Feb 10, 2026
@sre-ci-robot
Copy link
Contributor

[ci-v2-notice]
Notice: New ci-v2 system is enabled for this PR.

To rerun ci-v2 checks, comment with:

  • /ci-rerun-code-check // for ci-v2/code-check
  • /ci-rerun-build // for ci-v2/build
  • /ci-rerun-build-all // for ci-v2/build-all (multi-arch builds)
  • /ci-rerun-ut-integration // for ci-v2/ut-integration, will rerun ci-v2/build
  • /ci-rerun-ut-go // for ci-v2/ut-go, will rerun ci-v2/build
  • /ci-rerun-ut-cpp // for ci-v2/ut-cpp
  • /ci-rerun-ut // for all ci-v2/ut-integration, ci-v2/ut-go, ci-v2/ut-cpp, will rerun ci-v2/build
  • /ci-rerun-e2e-arm // for ci-v2/e2e-arm
  • /ci-rerun-e2e-default // for ci-v2/e2e-default

If you have any questions or requests, please contact @zhikunyao.

@mergify
Copy link
Contributor

mergify bot commented Feb 10, 2026

@weiliu1031 Please associate the related issue to the body of your Pull Request. (eg. "issue: #")

…ctions

issue: milvus-io#45881

This change introduces manual refresh capability for external
collections, allowing users to trigger on-demand data synchronization
from external sources. It replaces the legacy update mechanism with a
more robust job-task hierarchy and persistent state management.

Key changes:
- Add RefreshExternalCollection, GetRefreshExternalCollectionProgress,
  and ListRefreshExternalCollectionJobs APIs across Client, Proxy,
  and DataCoord
- Implement ExternalCollectionRefreshManager to manage refresh jobs
  with a 1:N Job-Task hierarchy
- Add ExternalCollectionRefreshMeta for persistent storage of jobs and
  tasks in the metastore
- Add ExternalCollectionRefreshChecker for task state management and
  worker assignment
- Implement ExternalCollectionRefreshInspector for periodic job
  cleanup
- Use WAL Broadcast mechanism for distributed consistency and
  idempotency
- Replace legacy external_collection_inspector and update tasks with
  the new refresh-based implementation
- Add comprehensive unit tests for refresh job lifecycle and state
  transitions

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
…nality

- Add RefreshExternalCollectionOption tests in client
- Add util key building tests for external refresh jobs/tasks
- Add KV catalog tests for external collection refresh operations
- Add MixCoord tests for refresh methods (refresh, progress, list)
- Add distributed service tests for refresh RPC endpoints
- Add distributed client tests for refresh operations

All tests validate correct behavior of external collection manual
refresh feature introduced in part3 enhancement.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
…n cache miss fix

- Add RefreshExternalCollection, GetRefreshExternalCollectionProgress,
  ListRefreshExternalCollectionJobs RPC forwarding to mixcoord client,
  service and proxy service layers
- Register ExternalCollection task type in AppendType/GetTaskType
- Replace direct meta lookup with collectionGetter callback in refresh
  manager to handle collection cache miss race condition
- Replace CreateExternalCollection implementation with no-op stub
- Regenerate mock_mixcoord and mock_mixcoord_client via make
  generate-mockery
- Add unit tests for properties, proxy forwarding and no-op stub

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
…ions

Implement the data mapping pipeline that converts external data files
(Parquet) into Milvus segments with proper column mapping.

Key changes:
- Pre-allocate segment IDs in DataCoord, pass to DataNode for direct
  final-path manifest writes (eliminating two-phase ID workflow)
- Add FFI bridges for file exploration (ExploreFiles, GetFileInfo) and
  manifest creation (CreateManifestForSegment, ReadFragmentsFromManifest)
- Implement fragment-to-segment balancing with configurable target rows
- Add ExternalSpec parser for external data format configuration
- Extend UpdateExternalCollectionRequest proto with schema, storage
  config, and pre-allocated segment ID fields
- Add E2E test for external collection refresh with data verification

Signed-off-by: Jiquan Long <jiquan.long@zilliz.com>
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
@weiliu1031 weiliu1031 force-pushed the part4_enable_data_mapping branch from 301efc0 to 88014cf Compare February 10, 2026 12:49
@github-actions
Copy link
Contributor

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

@codecov
Copy link

codecov bot commented Feb 10, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.49%. Comparing base (0eb971b) to head (88014cf).
⚠️ Report is 3 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #47730       +/-   ##
===========================================
+ Coverage   74.12%   82.49%    +8.37%     
===========================================
  Files        1452      614      -838     
  Lines      238922    98356   -140566     
===========================================
- Hits       177091    81139    -95952     
+ Misses      53686    17165    -36521     
+ Partials     8145       52     -8093     
Components Coverage Δ
Client ∅ <ø> (∅)
Core 83.29% <ø> (∅)
Go ∅ <ø> (∅)
see 2039 files with indirect coverage changes
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/dependency Pull requests that update a dependency file area/internal-api area/test dco-passed DCO check passed. kind/feature Issues related to feature request from users sig/testing size/XXL Denotes a PR that changes 1000+ lines.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants