-
Notifications
You must be signed in to change notification settings - Fork 82
feat: Ensure Control Plane Availability During Kubeflow SDK Client Creation #222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
🎉 Welcome to the Kubeflow SDK! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds control plane version verification during SDK client initialization to ensure compatibility between the SDK client version and the deployed Kubeflow Trainer control plane version.
Changes:
- Added constants for the Kubeflow namespace and version ConfigMap name
- Implemented
verify_backend()method in the KubernetesBackend to check version compatibility via a ConfigMap - Added base class method
verify_backend()with default no-op implementation for other backends - Integrated verification call into TrainerClient initialization
- Added comprehensive unit tests for version verification scenarios
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| kubeflow/trainer/constants/constants.py | Added constants for Kubeflow namespace and version ConfigMap name |
| kubeflow/trainer/backends/base.py | Added abstract verify_backend() method to RuntimeBackend base class |
| kubeflow/trainer/backends/kubernetes/backend.py | Implemented verify_backend() to check version compatibility via ConfigMap |
| kubeflow/trainer/backends/kubernetes/backend_version_test.py | Added comprehensive tests for version verification scenarios |
| kubeflow/trainer/api/trainer_client.py | Integrated verify_backend() call during client initialization |
| kubeflow/trainer/api/trainer_client_test.py | Added test to verify that verify_backend is called during initialization |
|
|
||
| self.namespace = cfg.namespace | ||
|
|
||
| def verify_backend(self) -> None: |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a docstring to this method explaining its purpose, parameters (though there are none), return value, and what exceptions it may raise. The docstring should document that this method logs warnings for version mismatches or missing ConfigMaps, and raises ApiException for other API errors.
| def verify_backend(self) -> None: | |
| def verify_backend(self) -> None: | |
| """Verify that the Kubeflow Trainer backend is compatible with this client. | |
| This method reads a ConfigMap in the Kubeflow system namespace that stores | |
| the server-side Kubeflow Trainer version and compares it with the client | |
| library version. If the ConfigMap is missing, or if the versions do not | |
| match, a warning is logged but no exception is raised in those cases. | |
| Parameters: | |
| None besides ``self``. | |
| Returns: | |
| None. This method is used for validation and logging side effects only. | |
| Raises: | |
| kubernetes.client.ApiException: If an error occurs while reading the | |
| ConfigMap from the Kubernetes API other than a 404 Not Found error. | |
| In particular, a 404 is handled by logging a warning and returning | |
| normally, while all other API errors are propagated. | |
| """ |
| def test_verify_backend_match(backend, caplog): | ||
| # Setup mock | ||
| mock_config_map = MagicMock() | ||
| mock_config_map.data = {"version": kubeflow_trainer_api.__version__} | ||
| backend.core_api.read_namespaced_config_map.return_value = mock_config_map | ||
|
|
||
| with caplog.at_level(logging.WARNING): | ||
| backend.verify_backend() | ||
|
|
||
| # Assert no warning | ||
| assert "Kubeflow Trainer version mismatch" not in caplog.text | ||
|
|
||
| # Verify call arguments | ||
| backend.core_api.read_namespaced_config_map.assert_called_with( | ||
| name=constants.TRAINER_VERSION_CONFIG_MAP, | ||
| namespace=constants.KUBEFLOW_NAMESPACE | ||
| ) | ||
|
|
||
| def test_verify_backend_mismatch(backend, caplog): | ||
| # Setup mock | ||
| mock_config_map = MagicMock() | ||
| mock_config_map.data = {"version": "0.0.0"} # Mismatch | ||
| backend.core_api.read_namespaced_config_map.return_value = mock_config_map | ||
|
|
||
| with caplog.at_level(logging.WARNING): | ||
| backend.verify_backend() | ||
|
|
||
| # Assert warning | ||
| assert "Kubeflow Trainer version mismatch" in caplog.text | ||
| assert "Server version: 0.0.0" in caplog.text | ||
|
|
||
| def test_verify_backend_not_found(backend, caplog): | ||
| # Setup mock to raise 404 | ||
| error = client.ApiException(status=404) | ||
| backend.core_api.read_namespaced_config_map.side_effect = error | ||
|
|
||
| with caplog.at_level(logging.WARNING): | ||
| backend.verify_backend() | ||
|
|
||
| # Assert warning about not found | ||
| assert f"ConfigMap '{constants.TRAINER_VERSION_CONFIG_MAP}' not found" in caplog.text | ||
|
|
||
| def test_verify_backend_other_error(backend): | ||
| # Setup mock to raise 500 | ||
| error = client.ApiException(status=500) | ||
| backend.core_api.read_namespaced_config_map.side_effect = error | ||
|
|
||
| with pytest.raises(client.ApiException): | ||
| backend.verify_backend() |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add test case for when config_map.data is None. The code handles this with (config_map.data or {}).get("version"), but it's important to test this edge case explicitly to ensure the warning message correctly reports server_version as None.
| else: | ||
| raise ValueError(f"Invalid backend config '{backend_config}'") | ||
|
|
||
| self.backend.verify_backend() |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update the docstring to document that the init method may raise ApiException if backend verification fails (e.g., for non-404 Kubernetes API errors). This is important for API consumers to understand the potential exceptions during client initialization.
| else: | ||
| raise ValueError(f"Invalid backend config '{backend_config}'") | ||
|
|
||
| self.backend.verify_backend() |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider wrapping the verify_backend call in a try-except block to handle potential ApiException errors gracefully. Currently, if verify_backend raises an ApiException (for non-404 errors), it will propagate to the caller and potentially cause TrainerClient initialization to fail unexpectedly. Consider logging errors and allowing initialization to continue, or clearly document that initialization can fail due to verification issues.
| raise NotImplementedError() | ||
|
|
||
| def verify_backend(self) -> None: | ||
| """Verify that the backend is configured correctly and compatible.""" |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring is minimal. Consider expanding it to explain when subclasses should override this method (e.g., when they need to verify backend-specific compatibility), what it should do (verify compatibility and log warnings), and what exceptions it may raise.
| """Verify that the backend is configured correctly and compatible.""" | |
| """Verify that the backend is configured correctly and compatible. | |
| This method is a hook for backend implementations to perform any | |
| backend-specific verification before running training jobs. Subclasses | |
| should override this method when they need to ensure that: | |
| * Required services, clusters, or environments are reachable. | |
| * Credentials or authentication/authorization are correctly configured. | |
| * The runtime or job configuration is supported by this backend. | |
| Implementations should raise an appropriate exception (for example | |
| ``RuntimeError`` or a backend-specific error type) if the backend is | |
| unusable and training cannot proceed. For non-fatal or recoverable | |
| issues, implementations may instead log warnings or emit diagnostics | |
| without raising. | |
| The default implementation is a no-op and does not perform any checks | |
| or raise exceptions. | |
| """ |
| @@ -0,0 +1,89 @@ | |||
|
|
|||
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the leading blank line at the start of the file. Python files should start with the copyright header or code, not with a blank line.
| def test_verify_backend_called(): | ||
| with patch("kubeflow.trainer.api.trainer_client.KubernetesBackend") as MockBackend: | ||
| mock_instance = MockBackend.return_value | ||
|
|
||
| TrainerClient() | ||
|
|
||
| mock_instance.verify_backend.assert_called_once() |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test only verifies that verify_backend is called, but doesn't ensure that the existing test_backend_selection tests will pass. The test_backend_selection test cases that use Kubernetes mocks (lines 53-68) will now call verify_backend() during TrainerClient initialization, which attempts to read a ConfigMap. Those tests need to mock read_namespaced_config_map on the CoreV1Api to prevent test failures.
| f"Client version: {client_version}, " | ||
| f"Server version: {server_version}. " | ||
| "Some features might not work as expected." | ||
| ) |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider logging an info-level message when versions match successfully, to provide positive feedback that the verification was performed and succeeded. This would help with debugging and tracing client initialization.
| ) | |
| ) | |
| else: | |
| logger.info( | |
| "Kubeflow Trainer client and server versions match: %s. " | |
| "Version compatibility verification succeeded.", | |
| client_version, | |
| ) |
sameerdattav
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requested changes
Namespace handling
- Currently hard-coded to
"kubeflow". - Should use the
KUBEFLOW_SYSTEM_NAMESPACEenv var (default"kubeflow").
ConfigMap data key
- Uses
data["version"]. - Should use
data["kubeflow_trainer_api_version"].
Local version source
- Uses
kubeflow_trainer_api.__version__. - Should use
importlib.metadata.version("kubeflow-trainer-api").
Mismatch / missing ConfigMap behavior
- Currently logs warnings and continues.
- Requirement is to fail fast:
- Version mismatch →
RuntimeErrorwith a clear message. - Missing/unreadable ConfigMap →
RuntimeError(do not proceed).
- Version mismatch →
Opt-out support
- Missing
KUBEFLOW_TRAINER_SKIP_VERSION_CHECK. - If set, verification should be skipped entirely.
Where the check runs
- Currently called from
TrainerClient.__init__. - Requirement is to run it in
KubernetesBackend.__init__(afterCoreV1Apiis constructed), so it also applies when the backend is used directly.
Tests
- Tests currently assert warning-only behavior; they should be updated to expect
RuntimeErroron mismatch and missing ConfigMap. test_backend_selectionwill need to mockread_namespaced_config_mapsince verification now runs during init.- It would also be good to add a test for
config_map.data is Noneand forKUBEFLOW_TRAINER_SKIP_VERSION_CHECK.
|
@sameerdattav: changing LGTM is restricted to collaborators DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What this PR does / why we need it:
This PR introduces a verification procedure to ensure that the appropriate version of the control plane is deployed for the corresponding SDK client
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...format, will close the issue(s) when PR gets merged):Fixes #221
Checklist: