Skip to content

Conversation

@sameerdattav
Copy link

This PR adds a non-blocking control-plane version metadata check to the Kubernetes backend.

The backend now verifies that the public Trainer ConfigMap exists and contains a kubeflow_trainer_api_version field, logging warnings when metadata is unavailable but never failing client initialization. Focused unit tests cover the expected behaviors.

Supersedes #223

Fixes #221

Copilot AI review requested due to automatic review settings January 30, 2026 12:19
@github-actions
Copy link

🎉 Welcome to the Kubeflow SDK! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards
  • Our team will review your PR soon! cc @kubeflow/kubeflow-sdk-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a non-blocking control-plane version metadata check for the Kubeflow Trainer Kubernetes backend. The check verifies that the kubeflow-trainer-public ConfigMap exists and contains a kubeflow_trainer_api_version field, logging warnings when metadata is unavailable but never failing client initialization. This represents a more lenient approach compared to the superseded PR #223, which enforced strict version compatibility.

Changes:

  • Added verify_backend() method to check for control-plane version metadata availability
  • Added comprehensive unit tests covering success and failure scenarios
  • Imported os module for environment variable access

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
kubeflow/trainer/backends/kubernetes/backend.py Implements non-blocking version check via verify_backend() method called during initialization; adds os import for namespace configuration
kubeflow/trainer/backends/kubernetes/backend_test.py Adds test helper function and five test cases covering success, missing ConfigMap, missing data key, None data, and dev version scenarios; includes minor docstring formatting improvement

@astefanutti
Copy link
Contributor

/ok-to-test

@andreyvelich
Copy link
Member

/retest

"""Verify that the Trainer control plane exposes version metadata.

This check only ensures that the public control-plane ConfigMap exists
and contains a ``kubeflow_trainer_api_version`` field. It does not
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enforce version compatibility and never raises.
"""

system_namespace = os.getenv("KUBEFLOW_SYSTEM_NAMESPACE", "kubeflow")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default system namespace is:

Suggested change
system_namespace = os.getenv("KUBEFLOW_SYSTEM_NAMESPACE", "kubeflow")
system_namespace = os.getenv("KUBEFLOW_SYSTEM_NAMESPACE", "kubeflow-system")

Comment on lines 79 to 106
except Exception as e:
logger.warning(
"Trainer control-plane version info is not available: "
f"failed to read ConfigMap '{config_map_name}' in namespace "
f"'{system_namespace}', error: {e!r}."
)
return

data = getattr(config_map, "data", None)

if not isinstance(data, dict):
logger.warning(
"Trainer control-plane version info is not available: "
f"ConfigMap '{config_map_name}' in namespace "
f"'{system_namespace}' has no data dictionary."
)
return

server_version = data.get("kubeflow_trainer_api_version")

if not server_version:
logger.warning(
"Trainer control-plane version info is not available: "
f"ConfigMap '{config_map_name}' in namespace "
f"'{system_namespace}' is missing the "
"'kubeflow_trainer_api_version' data key."
)
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just put everything under the same try: condition, and log warning if something fails.
Additionally, you can log the Exeception in the message.
You should expect that configMap has the correct data there.

Comment on lines 110 to 112
if server_version == "dev":
return

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need this

Suggested change
if server_version == "dev":
return

)


def test_version_check_success(caplog):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refactor this tests to use TestCases like described here: https://github.com/kubeflow/sdk/blob/aab449ac829ac02c804758e6d72f0b2652a2d783/AGENTS.md

@coveralls
Copy link

coveralls commented Feb 3, 2026

Pull Request Test Coverage Report for Build 21873621460

Details

  • 53 of 53 (100.0%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.4%) to 67.958%

Totals Coverage Status
Change from base Build 21826703132: 0.4%
Covered Lines: 2825
Relevant Lines: 4157

💛 - Coveralls

@sameerdattav
Copy link
Author

@andreyvelich , Thanks for the review, addressed all your comments
Kindly check again.


return [record.getMessage() for record in handler.records]

def test_version_check_success(self) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor these tests with a single test_verify_backend() function and TestCases, like described here: https://github.com/kubeflow/sdk/blob/main/AGENTS.md#3-testing-requirements

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich, updated backend_test.py file, This matches the AGENTS.md pattern: a single pytest function, parametrized with the shared TestCase dataclass to cover multiple scenarios.

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

},
),
TestCase(
name="missing data key logs warning",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update your test cases according to your updated implementation?
E.g. we don't check data key for now.

@sameerdattav
Copy link
Author

@andreyvelich,
Removed checks for specific ConfigMap data keys and adjusted the tests so we only warn when the ConfigMap cannot be read. Empty or missing data is now treated as a valid, non-blocking case.

@astefanutti
Copy link
Contributor

/retest

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/assign @kramaranya @astefanutti @kubeflow/kubeflow-sdk-team @Fiona-Waters

Copy link
Contributor

@astefanutti astefanutti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sameerdattav!

/lgtm

Not sure why the docs workflow fails, it seems like a transient issue.

@andreyvelich
Copy link
Member

@sameerdattav Can you try to rebase your PR to fix the CI?

Signed-off-by: Surya Sameer Datta Vaddadi <f20220373@goa.bits-pilani.ac.in>
sameerdattav and others added 8 commits February 10, 2026 22:01
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Surya Sameer Datta Vaddadi <137607947+sameerdattav@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Surya Sameer Datta Vaddadi <137607947+sameerdattav@users.noreply.github.com>
Signed-off-by: Surya Sameer Datta Vaddadi <f20220373@goa.bits-pilani.ac.in>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Surya Sameer Datta Vaddadi <137607947+sameerdattav@users.noreply.github.com>
Signed-off-by: Surya Sameer Datta Vaddadi <f20220373@goa.bits-pilani.ac.in>
Signed-off-by: Surya Sameer Datta Vaddadi <f20220373@goa.bits-pilani.ac.in>
Signed-off-by: Surya Sameer Datta Vaddadi <f20220373@goa.bits-pilani.ac.in>
Signed-off-by: Surya Sameer Datta Vaddadi <f20220373@goa.bits-pilani.ac.in>
@sameerdattav sameerdattav force-pushed the trainer-backend-version-clean branch from 0db01b1 to a17d830 Compare February 10, 2026 16:32
@google-oss-prow google-oss-prow bot removed the lgtm label Feb 10, 2026
@google-oss-prow
Copy link

New changes are detected. LGTM label has been removed.

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from andreyvelich. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Surya Sameer Datta Vaddadi <f20220373@goa.bits-pilani.ac.in>
@sameerdattav
Copy link
Author

@andreyvelich @astefanutti
i rebased and ran pre-commit locally to fix formatting issues.
One new file CLAUDE.md , had CRLF line ending, hence the pre-commit changed it to LF.
The ci checks seem fine now.

@sameerdattav
Copy link
Author

@andreyvelich,
To prevent such line ending issues, there is a gitattributes fix, which is already in manifests.
I was planning to fix them across repos.
Related Slack thread : https://cloud-native.slack.com/archives/C0742LBR5BM/p1770067104325089
could you please provide your input on this?

@andreyvelich
Copy link
Member

Sure, that sounds good, k8s also has the same setting: https://github.com/kubernetes/kubernetes/blob/master/.gitattributes#L2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ensure Control Plane Availability During Kubeflow SDK Client Creation

8 participants