Skip to content

Conversation

@RavinduWeerakoon
Copy link

What this PR does / why we need it:
This PR introduces a verification procedure to ensure that the appropriate version of the control plane is deployed for the corresponding SDK client

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):

Fixes #221

Checklist:

  • Docs included if any changes are user facing

Copilot AI review requested due to automatic review settings January 10, 2026 05:15
@github-actions
Copy link

🎉 Welcome to the Kubeflow SDK! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards
  • Our team will review your PR soon! cc @kubeflow/kubeflow-sdk-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kramaranya for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds control plane version verification during SDK client initialization to ensure compatibility between the SDK client version and the deployed Kubeflow Trainer control plane version.

Changes:

  • Added constants for the Kubeflow namespace and version ConfigMap name
  • Implemented verify_backend() method in the KubernetesBackend to check version compatibility via a ConfigMap
  • Added base class method verify_backend() with default no-op implementation for other backends
  • Integrated verification call into TrainerClient initialization
  • Added comprehensive unit tests for version verification scenarios

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
kubeflow/trainer/constants/constants.py Added constants for Kubeflow namespace and version ConfigMap name
kubeflow/trainer/backends/base.py Added abstract verify_backend() method to RuntimeBackend base class
kubeflow/trainer/backends/kubernetes/backend.py Implemented verify_backend() to check version compatibility via ConfigMap
kubeflow/trainer/backends/kubernetes/backend_version_test.py Added comprehensive tests for version verification scenarios
kubeflow/trainer/api/trainer_client.py Integrated verify_backend() call during client initialization
kubeflow/trainer/api/trainer_client_test.py Added test to verify that verify_backend is called during initialization


self.namespace = cfg.namespace

def verify_backend(self) -> None:
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a docstring to this method explaining its purpose, parameters (though there are none), return value, and what exceptions it may raise. The docstring should document that this method logs warnings for version mismatches or missing ConfigMaps, and raises ApiException for other API errors.

Suggested change
def verify_backend(self) -> None:
def verify_backend(self) -> None:
"""Verify that the Kubeflow Trainer backend is compatible with this client.
This method reads a ConfigMap in the Kubeflow system namespace that stores
the server-side Kubeflow Trainer version and compares it with the client
library version. If the ConfigMap is missing, or if the versions do not
match, a warning is logged but no exception is raised in those cases.
Parameters:
None besides ``self``.
Returns:
None. This method is used for validation and logging side effects only.
Raises:
kubernetes.client.ApiException: If an error occurs while reading the
ConfigMap from the Kubernetes API other than a 404 Not Found error.
In particular, a 404 is handled by logging a warning and returning
normally, while all other API errors are propagated.
"""

Copilot uses AI. Check for mistakes.
Comment on lines +41 to +89
def test_verify_backend_match(backend, caplog):
# Setup mock
mock_config_map = MagicMock()
mock_config_map.data = {"version": kubeflow_trainer_api.__version__}
backend.core_api.read_namespaced_config_map.return_value = mock_config_map

with caplog.at_level(logging.WARNING):
backend.verify_backend()

# Assert no warning
assert "Kubeflow Trainer version mismatch" not in caplog.text

# Verify call arguments
backend.core_api.read_namespaced_config_map.assert_called_with(
name=constants.TRAINER_VERSION_CONFIG_MAP,
namespace=constants.KUBEFLOW_NAMESPACE
)

def test_verify_backend_mismatch(backend, caplog):
# Setup mock
mock_config_map = MagicMock()
mock_config_map.data = {"version": "0.0.0"} # Mismatch
backend.core_api.read_namespaced_config_map.return_value = mock_config_map

with caplog.at_level(logging.WARNING):
backend.verify_backend()

# Assert warning
assert "Kubeflow Trainer version mismatch" in caplog.text
assert "Server version: 0.0.0" in caplog.text

def test_verify_backend_not_found(backend, caplog):
# Setup mock to raise 404
error = client.ApiException(status=404)
backend.core_api.read_namespaced_config_map.side_effect = error

with caplog.at_level(logging.WARNING):
backend.verify_backend()

# Assert warning about not found
assert f"ConfigMap '{constants.TRAINER_VERSION_CONFIG_MAP}' not found" in caplog.text

def test_verify_backend_other_error(backend):
# Setup mock to raise 500
error = client.ApiException(status=500)
backend.core_api.read_namespaced_config_map.side_effect = error

with pytest.raises(client.ApiException):
backend.verify_backend()
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add test case for when config_map.data is None. The code handles this with (config_map.data or {}).get("version"), but it's important to test this edge case explicitly to ensure the warning message correctly reports server_version as None.

Copilot uses AI. Check for mistakes.
else:
raise ValueError(f"Invalid backend config '{backend_config}'")

self.backend.verify_backend()
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the docstring to document that the init method may raise ApiException if backend verification fails (e.g., for non-404 Kubernetes API errors). This is important for API consumers to understand the potential exceptions during client initialization.

Copilot uses AI. Check for mistakes.
else:
raise ValueError(f"Invalid backend config '{backend_config}'")

self.backend.verify_backend()
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider wrapping the verify_backend call in a try-except block to handle potential ApiException errors gracefully. Currently, if verify_backend raises an ApiException (for non-404 errors), it will propagate to the caller and potentially cause TrainerClient initialization to fail unexpectedly. Consider logging errors and allowing initialization to continue, or clearly document that initialization can fail due to verification issues.

Copilot uses AI. Check for mistakes.
raise NotImplementedError()

def verify_backend(self) -> None:
"""Verify that the backend is configured correctly and compatible."""
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring is minimal. Consider expanding it to explain when subclasses should override this method (e.g., when they need to verify backend-specific compatibility), what it should do (verify compatibility and log warnings), and what exceptions it may raise.

Suggested change
"""Verify that the backend is configured correctly and compatible."""
"""Verify that the backend is configured correctly and compatible.
This method is a hook for backend implementations to perform any
backend-specific verification before running training jobs. Subclasses
should override this method when they need to ensure that:
* Required services, clusters, or environments are reachable.
* Credentials or authentication/authorization are correctly configured.
* The runtime or job configuration is supported by this backend.
Implementations should raise an appropriate exception (for example
``RuntimeError`` or a backend-specific error type) if the backend is
unusable and training cannot proceed. For non-fatal or recoverable
issues, implementations may instead log warnings or emit diagnostics
without raising.
The default implementation is a no-op and does not perform any checks
or raise exceptions.
"""

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,89 @@

Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the leading blank line at the start of the file. Python files should start with the copyright header or code, not with a blank line.

Suggested change

Copilot uses AI. Check for mistakes.
Comment on lines +75 to +81
def test_verify_backend_called():
with patch("kubeflow.trainer.api.trainer_client.KubernetesBackend") as MockBackend:
mock_instance = MockBackend.return_value

TrainerClient()

mock_instance.verify_backend.assert_called_once()
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test only verifies that verify_backend is called, but doesn't ensure that the existing test_backend_selection tests will pass. The test_backend_selection test cases that use Kubernetes mocks (lines 53-68) will now call verify_backend() during TrainerClient initialization, which attempts to read a ConfigMap. Those tests need to mock read_namespaced_config_map on the CoreV1Api to prevent test failures.

Copilot uses AI. Check for mistakes.
f"Client version: {client_version}, "
f"Server version: {server_version}. "
"Some features might not work as expected."
)
Copy link

Copilot AI Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider logging an info-level message when versions match successfully, to provide positive feedback that the verification was performed and succeeded. This would help with debugging and tracing client initialization.

Suggested change
)
)
else:
logger.info(
"Kubeflow Trainer client and server versions match: %s. "
"Version compatibility verification succeeded.",
client_version,
)

Copilot uses AI. Check for mistakes.
Copy link

@sameerdattav sameerdattav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requested changes

Namespace handling

  • Currently hard-coded to "kubeflow".
  • Should use the KUBEFLOW_SYSTEM_NAMESPACE env var (default "kubeflow").

ConfigMap data key

  • Uses data["version"].
  • Should use data["kubeflow_trainer_api_version"].

Local version source

  • Uses kubeflow_trainer_api.__version__.
  • Should use importlib.metadata.version("kubeflow-trainer-api").

Mismatch / missing ConfigMap behavior

  • Currently logs warnings and continues.
  • Requirement is to fail fast:
    • Version mismatch → RuntimeError with a clear message.
    • Missing/unreadable ConfigMap → RuntimeError (do not proceed).

Opt-out support

  • Missing KUBEFLOW_TRAINER_SKIP_VERSION_CHECK.
  • If set, verification should be skipped entirely.

Where the check runs

  • Currently called from TrainerClient.__init__.
  • Requirement is to run it in KubernetesBackend.__init__ (after CoreV1Api is constructed), so it also applies when the backend is used directly.

Tests

  • Tests currently assert warning-only behavior; they should be updated to expect RuntimeError on mismatch and missing ConfigMap.
  • test_backend_selection will need to mock read_namespaced_config_map since verification now runs during init.
  • It would also be good to add a test for config_map.data is None and for KUBEFLOW_TRAINER_SKIP_VERSION_CHECK.

@google-oss-prow
Copy link

@sameerdattav: changing LGTM is restricted to collaborators

Details

In response to this:

Requested changes

Namespace handling

  • Currently hard-coded to "kubeflow".
  • Should use the KUBEFLOW_SYSTEM_NAMESPACE env var (default "kubeflow").

ConfigMap data key

  • Uses data["version"].
  • Should use data["kubeflow_trainer_api_version"].

Local version source

  • Uses kubeflow_trainer_api.__version__.
  • Should use importlib.metadata.version("kubeflow-trainer-api").

Mismatch / missing ConfigMap behavior

  • Currently logs warnings and continues.
  • Requirement is to fail fast:
  • Version mismatch → RuntimeError with a clear message.
  • Missing/unreadable ConfigMap → RuntimeError (do not proceed).

Opt-out support

  • Missing KUBEFLOW_TRAINER_SKIP_VERSION_CHECK.
  • If set, verification should be skipped entirely.

Where the check runs

  • Currently called from TrainerClient.__init__.
  • Requirement is to run it in KubernetesBackend.__init__ (after CoreV1Api is constructed), so it also applies when the backend is used directly.

Tests

  • Tests currently assert warning-only behavior; they should be updated to expect RuntimeError on mismatch and missing ConfigMap.
  • test_backend_selection will need to mock read_namespaced_config_map since verification now runs during init.
  • It would also be good to add a test for config_map.data is None and for KUBEFLOW_TRAINER_SKIP_VERSION_CHECK.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ensure Control Plane Availability During Kubeflow SDK Client Creation

2 participants