Skip to content

feat(gateway/data_plane): support gateway bound to multiple data_planes#2350

Draft
wklken wants to merge 3 commits intoTencentBlueKing:masterfrom
wklken:ft_support_multiple_data_plane
Draft

feat(gateway/data_plane): support gateway bound to multiple data_planes#2350
wklken wants to merge 3 commits intoTencentBlueKing:masterfrom
wklken:ft_support_multiple_data_plane

Conversation

@wklken
Copy link
Collaborator

@wklken wklken commented Jan 10, 2026

Description

this PR is processed by cursor(write the code), claude code(review), and me(write the details, review)

the task:

context:
- working dir is: `~/blueking-apigateway/src/dashboard`
- the django project root is: `~/blueking-apigateway/src/dashboard/apigateway`
- into the working dir: `cd ~/blueking-apigateway/src/dashboard && source .venv/bin/activate`

the target:
- a gateway can be bond to multiple data_plane, at least on data_plane
- each release/revoke, should applied to all bonded data_planes

plan:
1. python manage.py startapp data_plane
2. the data_plane models:
   - data_plane, id(auto pk), name(32 chars), description(512 chars), etcd_configs(json_field), bk_api_url_tmpl(512 chars), status(int, which DataPlaneStatusEnum), is_recommend(bool)
   - gateway_data_plane_binding, gateway(foreign key to Gateway in core/models.py), data_plane(foreign key to data_plane), one gateway can be bond to multiple data_planes; delete the gateway or data_plane should delete the binding record to (CASCAD?)
   - consider to make the etcd_configs encrypted, read the private_key, [todo]
   - add migrations files for thie app
3. bin/on_migrate: add or update default data_plane init after database migrated
   - name is `default`
   - description is `the default data_plane`
   - etcd_configs init from settings.ETCD_CONFIG, please read apigateway/conf/default.py
   - bk_api_url_tmpl is settings.BK_API_URL_TMPL, please read apigateway/conf/default.py
   - status is active
   - is_recommend is true [if create, set to true, if update, don't update this field]
4. add a django command to binding all the gateways to default data_plane
   - loop all the gateways
   - if gateway is not bond any data_plane, bind it to `default` data_plane
   - if gateway is already bond, do nothing
   - add this command into `bin/on_migrate`
5. apigateway/utils/etcd.py refactor
   - the get_etcd_client change to `new_etcd_client`? with the detail settings
6. apigateway/controller/distributor.py
   - create etcd client for each data_plane
   - read the etcd_config of data_plane and new_etcd_client
7. the Release should add data_plane(please think about it)
8. release_gateway_by_registry and other function in controller/tasks/release.py
   - create multiple release for each data_plane of gateway, if the data_plane status is active
   - for loop to do the release or revoke for each data_plane
   - each release has it's own release history, so the biz/gateway/releaser.py should make history and release for each data_plane of gateway?
9. check the api of release/release history, it should return the data_plane, and fill the `data_plane` object with (id,name,description) for display
   - api/web
   - api/v2/open
10. the api/open sync gateways and the api/v2/open sync gateways
   - if the gateway already exists, do nothing to data_plane
   - else, use the recommend data_plane as the gateway's dataplane
11. when revoke the gateway, revoke all data_plane
12. when delete the gateway, delete the binding data
13. don't add unittests now!
14. add the django admin for data_plane, could show and filter by the data_plane and gateway

Checklist

  • 填写 PR 描述及相关 issue (write PR description and related issue)
  • 代码风格检查通过 (code style check passed)
  • PR 中包含单元测试 (include unit test)
  • 单元测试通过 (unit test passed)
  • 本地开发联调环境验证通过 (local development environment verification passed)

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request implements support for gateways to be bound to multiple data planes, enabling better multi-environment and multi-cluster support for the API Gateway system.

Changes:

  • Introduces new data_plane app with DataPlane and GatewayDataPlaneBinding models for managing data plane configurations
  • Updates release and publishing logic to distribute gateway configurations to multiple data planes simultaneously
  • Adds management commands to initialize default data plane and bind existing gateways

Reviewed changes

Copilot reviewed 28 out of 33 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
src/dashboard/apigateway/apigateway/apps/data_plane/* New app implementing data plane models, managers, admin, and management commands
src/dashboard/apigateway/apigateway/core/models.py Added data_plane foreign key to ReleaseHistory model
src/dashboard/apigateway/apigateway/utils/crypto.py Extracted BkCrypto class for reuse
src/dashboard/apigateway/apigateway/utils/etcd.py Added new_etcd_client() for per-data-plane ETCD configurations
src/dashboard/apigateway/apigateway/biz/gateway/saver.py Updated to bind new gateways to data planes
src/dashboard/apigateway/apigateway/biz/gateway/releaser.py Updated to release to all active data planes
src/dashboard/apigateway/apigateway/controller/tasks/* Updated task signatures to accept data_plane_id parameter
src/dashboard/apigateway/apigateway/controller/publisher/publish.py Updated to iterate over data planes for publishing
src/dashboard/apigateway/apigateway/controller/distributor/etcd.py Updated to use per-data-plane ETCD clients
src/dashboard/apigateway/apigateway/apis/*/serializers.py Added data_planes field to gateway sync APIs
src/dashboard/bin/on_migrate Added commands to initialize default data plane

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


def get_queryset(self):
return ReleaseHistory.objects.filter(gateway=self.request.gateway)
return ReleaseHistory.objects.filter(gateway=self.request.gateway).select_related("data_plane")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we don't need the data_plane?

first_history = history
if deploy_history:
deploy_history.publish_id = history.id
deploy_history.save()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: the programmable gateway should be only bound to one data_plane?
here only save the deploy_history at the first_history


# record audit log only for the first data plane to avoid duplicates
if history == first_history:
Auditor.record_release_op_success(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: only one audit record?

)

def release(self):
def release(self) -> ReleaseHistory:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FIXME: the function should return multiple ReleaseHistory?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO

@wklken
Copy link
Collaborator Author

wklken commented Jan 11, 2026

Code review

No critical issues found (confidence threshold: 80+). Checked for bugs and CLAUDE.md compliance.

Review Summary:

  • Analyzed 32 changed files with 5 specialized review agents
  • Identified 28 potential issues across different categories
  • 5 issues scored 75 (high confidence, see local review file for details)
  • 0 issues scored 80+ (threshold for automated reporting)

A detailed review report with all findings has been saved locally to 2350.md for reference.

Highest-confidence findings (75):

  • Missing error handling in Celery tasks for DataPlane.objects.get()
  • Missing atomic transactions for gateway-data plane binding
  • No rollback mechanism for partial multi-data-plane releases
  • Breaking API change in GatewayResourceDistributor constructor
  • Missing test coverage for multi-data-plane scenarios

These issues may warrant attention before moving out of draft status, but they didn't reach the certainty threshold (80+) for automated reporting.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

@wklken
Copy link
Collaborator Author

wklken commented Jan 11, 2026

Code Review - Detailed Findings

Comprehensive review of PR #2350 analyzed 32 changed files (1100 additions, 138 deletions).


Critical Issues (Confidence: 75)

1. Missing Error Handling in Celery Tasks

Issue: Celery tasks call DataPlane.objects.get(id=data_plane_id) without exception handling. If a data plane is deleted between task queueing and execution, the task will crash with DoesNotExist exception.

Files:

Recommendation: Use .filter().first() with fallback or add explicit exception handling.


2. Missing Atomic Transaction for Gateway-Data Plane Binding

Issue: _bind_to_data_planes() is called after gateway.save() without an atomic transaction wrapper. If binding fails, the gateway will exist without any data plane binding, causing releases to fail later.

File:

def _get_gateway(self, gateway_id: Optional[int]) -> Optional[Gateway]:
if gateway_id:
return Gateway.objects.get(id=gateway_id)
return None
def save(self) -> Gateway:
# 网关为 None,则新建网关;非 None,则更新网关
if not self._gateway:
self._create_gateway()
else:
self._update_gateway()
assert self._gateway
return self._gateway
def _create_gateway(self):
# 1. save gateway
self._gateway = gateway = Gateway(
name=self._gateway_data.name,
description=self._gateway_data.description,
description_en=self._gateway_data.description_en,
maintainers=self._gateway_data.maintainers,
status=self._gateway_data.status,
is_public=self._gateway_data.is_public,
tenant_mode=self._gateway_data.tenant_mode,
tenant_id=self._gateway_data.tenant_id,
created_by=self.username,
updated_by=self.username,
)
gateway.save()
# 2. save related data
GatewayHandler.save_related_data(
gateway=gateway,
user_auth_type=settings.DEFAULT_USER_AUTH_TYPE,
username=self.username,
related_app_code=self.bk_app_code,
user_config=self._gateway_data.user_config,
unfiltered_sensitive_keys=self._get_gateway_unfiltered_sensitive_keys(gateway.name),
api_type=self._gateway_data.gateway_type,
allow_auth_from_params=self._gateway_data.allow_auth_from_params,
allow_delete_sensitive_params=self._gateway_data.allow_delete_sensitive_params,
source=self._source,
)
# 3. bind to data plane(s)
self._bind_to_data_planes(gateway)
def _bind_to_data_planes(self, gateway: Gateway):
"""Bind newly created gateway to data plane(s)"""
data_planes_to_bind: List[DataPlane] = []
# Priority 1: data_plane_names provided (from open API sync)
if self._data_plane_names:
for name in self._data_plane_names:
data_plane = DataPlane.objects.filter(name=name).first()
if data_plane:
data_planes_to_bind.append(data_plane)
else:
logger.warning("Data plane with name '%s' not found, skipping", name)
# Priority 2: data_plane_id provided (from web API)
elif self._data_plane_id:
data_plane = DataPlane.objects.filter(id=self._data_plane_id).first()
if data_plane:
data_planes_to_bind.append(data_plane)
else:
logger.warning("Data plane with id '%s' not found", self._data_plane_id)
# Fallback: use the 'default' data plane
if not data_planes_to_bind:
default_data_plane = DataPlane.objects.get_default()
if default_data_plane:
data_planes_to_bind.append(default_data_plane)
else:
logger.error(
"No data planes to bind for gateway '%s' and no '%s' data plane found",
gateway.name,

Recommendation: Wrap gateway creation and data plane binding in @transaction.atomic() decorator.


3. No Rollback Mechanism for Partial Data Plane Releases

Issue: The release() method creates ReleaseHistory and triggers releases for each data plane in a loop without transaction wrapping. If release fails for data_plane 2 of 3, the first data plane will have successfully released but there's no rollback, leaving the system in an inconsistent state.

File: https://github.com/TencentBlueKing/blueking-apigateway/blob/436cfe98f25a4fe9266caa70a7190906e52fabdd/src/dashboard/apigateway/apigateway/biz/gateway/releaser.py#L962-L1004

Recommendation: Implement a rollback mechanism or compensating transactions for failed releases.


4. Breaking API Change in GatewayResourceDistributor

Issue: The constructor signature changed from __init__(self, release: Release) to __init__(self, release: Release, data_plane: DataPlane), making data_plane a required parameter. Any existing subclasses or external code calling GatewayResourceDistributor will break.

File:

Recommendation: Consider making data_plane optional with a default value for backward compatibility.


5. Missing Test Coverage for Multi-Data-Plane Scenarios

Issue: The PR introduces multi-data-plane support but doesn't appear to include tests for edge cases like:

  • Releasing to multiple data planes when one fails
  • Handling data plane deletion during release
  • Gateway with no data plane bindings
  • Mixed active/inactive data planes

Recommendation: Add comprehensive test coverage for multi-data-plane scenarios.


Moderate Issues (Confidence: 50-70)

6. Audit Log Only Recorded for First Data Plane (Confidence: 50)

Only the first release is audited while others are invisible to audit logs, breaking audit trail completeness.

https://github.com/TencentBlueKing/blueking-apigateway/blob/436cfe98f25a4fe9266caa70a7190906e52fabdd/src/dashboard/apigateway/apigateway/biz/gateway/releaser.py#L988-L997


7. Bare Exception Handling in Crypto Operations (Confidence: 50)

The etcd_configs property catches all exceptions and returns an empty dict, which could silently hide critical errors.

https://github.com/TencentBlueKing/blueking-apigateway/blob/436cfe98f25a4fe9266caa70a7190906e52fabdd/src/dashboard/apigateway/apigateway/apps/data_plane/models.py#L821-L823


8. Missing Error History for Validation Failures (Confidence: 70)

If validation fails and there are no active data planes, no ReleaseHistory is created for error tracking.

https://github.com/TencentBlueKing/blueking-apigateway/blob/436cfe98f25a4fe9266caa70a7190906e52fabdd/src/dashboard/apigateway/apigateway/biz/gateway/releaser.py#L1016-L1018


9. Incomplete Error Logging for Pre-Release Validation (Confidence: 60)

If validation fails and there are no data planes, error handling skips creating release history.

https://github.com/TencentBlueKing/blueking-apigateway/blob/436cfe98f25a4fe9266caa70a7190906e52fabdd/src/dashboard/apigateway/apigateway/biz/gateway/releaser.py#L964


10. N+1 Query in Data Plane Binding Loop (Confidence: 50)

Individual queries for each data_plane_name instead of bulk fetch.

https://github.com/TencentBlueKing/blueking-apigateway/blob/436cfe98f25a4fe9266caa70a7190906e52fabdd/src/dashboard/apigateway/apigateway/biz/gateway/saver.py#L1125-L1130


Summary

Total Issues: 28 analyzed

  • 5 high-confidence issues (75) - Should be addressed before production
  • 13 moderate-confidence issues (50-74) - Consider reviewing
  • 10 low-confidence issues (0-49) - Likely false positives

Full detailed report with all 28 issues available in local file 2350.md.

🤖 Generated with Claude Code

@wklken
Copy link
Collaborator Author

wklken commented Jan 26, 2026

Important

  1. Silent ignore of unknown data plane names
    In apis/open/gateway/views.py and apis/v2/sync/views.py, data_planes names are translated to IDs by .filter(...).first() and silently skipped when missing. The serializer accepts arbitrary names without validation.
    Question: Should the API return a validation error when a provided name doesn’t exist to avoid accidentally binding to the default data plane?
  2. Gateway can end up unbound after create
    In biz/gateway/saver.py, if data_plane_ids are provided but none resolve, it falls back to the default. If the default doesn’t exist, it logs and returns without binding. Later, publish flows now raise when no active data plane exists.
    Question: Should gateway creation fail fast (or rollback) when no data plane can be resolved, instead of creating an unbound gateway?
  3. ETCD config decrypt failure leads to empty client config
    apps/data_plane/models.py returns {} on decrypt error; utils/etcd.new_etcd_client() then calls client(**etcd_config). If required keys are missing, this may raise or connect unexpectedly.
    Question: Would it be safer to raise a clear error (or mark the data plane invalid) rather than returning {} and trying to build a client?
  4. Task signature compatibility / in-flight tasks
    controller/tasks/release.py and controller/tasks/syncing.py now require data_plane_id.
    Question: Are there any existing scheduled/queued Celery tasks that still call the old signature? If so, should there be a backward-compatible default or guard?
  5. Data plane mismatch not validated in tasks
    release_gateway_by_registry() fetches ReleaseHistory by publish_id and independently loads data_plane_id; it doesn’t verify they match.
    Question: Should the task verify release_history.data_plane_id == data_plane_id to prevent mismatches or bad input?

Issues by Priority
🔴 Critical: Must Fix Before Merging

  1. Silent failure when no data planes are bound (publish.py)
    In _trigger_rolling_update, _trigger_revoke_disable, and _trigger_revoke_deleting, if no data planes are found, the code logs an error and continues with continue:
    src__dashboard__apigateway__apigateway__controller__publisher__publish.py.diff
    Lines 38-41
  •     if not data_planes:+             logger.error(+                 "Gateway (id=%s) has no active data planes, cannot trigger rolling update",+                 release.gateway_id,
    

Could this lead to silent failures where users think a release succeeded but nothing was actually published? Would it be better to fail explicitly or at least return a failure status that surfaces to the API response?
🟡 Important: Should Address
2. BkCrypto exception handling swallows errors silently (models.py)
src__dashboard__apigateway__apigateway__apps__data_plane__models.py.diff
Lines 86-88

  •    except Exception:+            logger.exception("Failed to decrypt etcd_configs for data_plane %s", self.name)+            return {}
    

When decryption fails, an empty dict is returned. Could this cause downstream ETCD connection failures that are hard to debug? Would it be clearer to either raise an exception or return a sentinel value that explicitly indicates a decryption failure?
3. Multiple database queries in loop (saver.py)
src__dashboard__apigateway__apigateway__biz__gateway__saver.py.diff
Lines 58-63

  •     if self._data_plane_ids:+             found_ids = set()+             for data_plane_id in self._data_plane_ids:+                 data_plane = DataPlane.objects.filter(id=data_plane_id).first()+                 if data_plane:+                     data_planes_to_bind.append(data_plane)
    

This issues one query per data_plane_id. Would it make sense to batch this into a single query: DataPlane.objects.filter(id__in=self._data_plane_ids)?
4. Duplicated data_plane_names to IDs conversion logic (views.py)
The same conversion logic appears in both apis/open/gateway/views.py and apis/v2/sync/views.py:
src__dashboard__apigateway__apigateway__apis__open__gateway__views.py.diff
Lines 12-19

  •     data_plane_ids = None+         data_plane_names = data.get("data_planes")+         if data_plane_names:+             data_plane_ids = []+             for name in data_plane_names:+                 data_plane = DataPlane.objects.filter(name=name).first()+                 if data_plane:+                     data_plane_ids.append(data_plane.id)
    

Could this be extracted to a helper function to reduce duplication and ensure consistent behavior?
5. Potential race condition on first_history assignment (releaser.py)
src__dashboard__apigateway__apigateway__biz__gateway__releaser.py.diff
Lines 75-95

  •     first_history = None+         for data_plane in data_planes:+             history = self._save_release_history(data_plane=data_plane)+ +             if first_history is None:+                 first_history = history+                 if deploy_history:+                     deploy_history.publish_id = history.id+                     deploy_history.save()
    

The logic assumes data_planes is non-empty (checked earlier), but the redundant check at line 94-96 suggests uncertainty. Is this defensive coding, or does it indicate a possible race condition where data_planes could become empty between the check and the loop?
6. No transaction wrapper for multi-data-plane release (releaser.py)
When releasing to multiple data planes, if one fails mid-way, some data planes will have release history records while others won't. Would wrapping the entire operation in a database transaction be appropriate, or is partial success acceptable here?
🟢 Suggestions: Nice to Have
7. Consider adding indexes on frequently queried fields
The GatewayDataPlaneBinding model filters by gateway_id frequently. While Django adds indexes for foreign keys automatically, would a composite index on (gateway_id, data_plane__status) improve query performance for get_gateway_active_data_planes?
8. ETCD client lifecycle management (etcd.py)
src__dashboard__apigateway__apigateway__utils__etcd.py.diff
Lines 31-32
+def new_etcd_client(etcd_config: Dict[str, Any]) -> etcd3.Etcd3Client:...+ return client(**etcd_config)
The new new_etcd_client function creates a fresh client each time. Is there any connection pooling or reuse strategy that should be considered for performance?
9. Return value changes in trigger functions (publish.py)
The _trigger_rolling_update, _trigger_revoke_disable, and _trigger_revoke_deleting functions now return None instead of True/result, with the callers returning True. Would it be cleaner to have consistent return values across these functions?
10. Logging improvement for multi-data-plane operations
When a gateway releases to multiple data planes, the logs could be clearer about which operations succeeded vs failed. Would structured logging with aggregated results help with debugging production issues?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant