Add KEP-936: Introduce Kubeflow-MCP proposal for AI-Powered Training … by abhijeet-dhumal · Pull Request #937 · kubeflow/community

abhijeet-dhumal · 2026-01-28T09:16:11Z

Resolves #936

…Interface Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

google-oss-prow · 2026-01-28T09:16:20Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign franciscojavierarceo for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

astefanutti · 2026-01-29T10:10:10Z

proposals/936-kubeflow-mcp-server/README.md

+
+### Trainer Selection Logic
+
+![Trainer Selection](trainer-selection.png)


I think it would be useful to link to kubeflow/trainer#2839 so it's extensible and support the future trainers that will be added.

Thanks a lot @astefanutti for reviewing 🙌
Awesome suggestion, I have added references to KEP-2839 accordingly ✅

astefanutti · 2026-01-29T10:12:02Z

proposals/936-kubeflow-mcp-server/README.md

+| **Unauthorized Access** | Policy layer enforces RBAC at tool level |
+| **Scope Creep** | Clear delegation to `kubernetes-mcp-server` for generic K8s ops |
+
+## Design Details


A section that covers the security aspects would be very valuable.

Added a dedicated "Security Considerations" section ✅

astefanutti · 2026-01-29T10:14:07Z

proposals/936-kubeflow-mcp-server/README.md

+2. **Maintenance Overhead**: MCP layer must track SDK changes
+3. **Abstraction Layer**: Natural language may hide complexity users need to understand
+
+## Alternatives


As skills are becoming popular, a dedicated alternative section could provide context on how it compares with the proposal.

Great catch 🙌
I have added now "Alternative 4: Hugging Face Skills" with a detailed comparative analysis covering architecture, protocol, integration, and execution model differences. Also noted how Skills and MCP can be complementary, please review !

astefanutti · 2026-01-29T10:23:03Z

proposals/936-kubeflow-mcp-server/README.md

+- [KEP-2170: Kubeflow Trainer V2 API](https://github.com/kubeflow/trainer/blob/master/docs/proposals/2170-kubeflow-trainer-v2/README.md)
+
+### Related Issues
+- [#936: KEP Tracking Issue](https://github.com/kubeflow/community/issues/936) - This proposal


kubeflow/model-registry#2029 would be added

astefanutti · 2026-01-29T10:24:09Z

proposals/936-kubeflow-mcp-server/README.md

+
+| Tool | Purpose | Returns |
+|------|---------|---------|
+| `estimate_resources(model, peft_method)` | GPU/memory requirements | `{gpu_memory, recommended_gpus, feasible}` |


Resource estimation may need to call different tools depending on the trainers.

Added "Note on Resource Estimation" with a table showing trainer-specific estimation methods (BuiltinTrainer uses model params, CustomTrainer uses heuristics, CustomTrainerContainer defers to user). Also noted extensibility for KEP-2839 backends. ✅

…trainer-specific estimation Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

astefanutti · 2026-01-30T07:48:05Z

proposals/936-kubeflow-mcp-server/README.md

+| Method | Description | Use Case |
+|--------|-------------|----------|
+| **Kubeconfig** | Uses `~/.kube/config` or `KUBECONFIG` env var | Local development, CI/CD |
+| **ServiceAccount Token** | Mounted at `/var/run/secrets/kubernetes.io/serviceaccount/token` | In-cluster deployment |


In the case of in-cluster deployment, should the MCP server rather impersonate users?

Ah good point - I hadn't fully thought through the multi-user in-cluster scenario.

Please correct me here, So the flow would be:

MCP server runs with a ServiceAccount that has impersonate permissions

AI agent authenticates user and passes identity to MCP server

MCP server can use K8s impersonation (--as / Impersonate-User header) for API calls

This way RBAC is enforced per-user even with a shared MCP deployment.

Does this align with how you'd expect it to work? I'm curious if there's a standard pattern for this - perhaps similar to how the Notebooks controller handles user identity? 🤔

Thinking.. user identity can be extracted from the MCP client request headers but MCP protocol doesn't have a standard way to pass user identity. So how the MCP server will know WHO to impersonate? 🤔

@astefanutti @andreyvelich
Please correct me here but what if we use Istio/OAuth2Proxy for this case to inject user identity ? Kubeflow already uses Istio + OIDC right?

So if MCP server is deployed behind Kubeflow's Istio:

User already authenticated via Kubeflow dashboard

Istio adds x-forwarded-user: user@example.com to requests

MCP server reads header, impersonates that user for K8s API calls

astefanutti · 2026-01-30T07:48:33Z

proposals/936-kubeflow-mcp-server/README.md

+
+![Multi-MCP Ecosystem](multi-mcp.png)
+
+**Design Principle:** No overlap. `kubeflow-mcp` should handle training; `kubernetes-mcp-server` should handle generic K8s operations.


kubeflow-mcp would eventually handle more than training.

Ah right - I was thinking the principle should be: kubeflow-mcp owns Kubeflow-specific CRDs (TrainJob, Experiment, ModelVersion, etc.), while kubernetes-mcp-server handles generic K8s resources (PVCs, ConfigMaps, Secrets).

So the row should probably say "Kubeflow CRDs" rather than "TrainJob CRDs"?

Or do you think there's value in having separate MCP servers per Kubeflow component (trainer-mcp, katib-mcp, model-registry-mcp)?
I see Model Registry is building its own MCP server (model-registry#2029). This raises a design question 🤔

I leaned toward unified since it mirrors the unified SDK structure, but curious what you think? Separate servers might give component teams more autonomy.

astefanutti · 2026-01-30T07:49:53Z

proposals/936-kubeflow-mcp-server/README.md

+
+## Design Details
+
+### MCP Tool Inventory


Given the large number of tools, do you have an estimate of the size that it takes in the context based on your prototype?

Is there any concern w.r.t. to scaling the number of tools for all the Kubeflow components?

From the prototype, ~24 tools with docstrings comes to roughly 8-10K tokens.

My intuition is that this is manageable for current MCP clients, but I'm curious about your perspective. A few questions:

Do you know if there's a recommended upper bound for MCP tool count? I haven't found guidance on this in the spec.

For full Kubeflow coverage (Trainer + Katib + Model Registry + Pipelines), we might hit ~40-50 tools. Would lazy loading (only expose tools for installed components) be a reasonable mitigation?

The other approach is policy-based filtering - a readonly user sees only ~7 discovery tools, not all 50. Does this feel like the right direction?

And the discussion from above thread, should we plan separate MCP servers for each kubeflow component ? I think it's good to keep everything under 1 roof as a Unified kubeflow-mcp simialr to kubeflow-sdk which can wrap all SDK clients, single install.. but I would be happy to get more discussion or inputs on this aspect 💭

Happy to add a "Scalability Considerations" section if you think it's worth calling out explicitly.

cc: @andreyvelich @jaiakash @dhanishaphadate

Add KEP-936: Introduce Kubeflow-MCP proposal for AI-Powered Training …

09f52a6

…Interface Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

google-oss-prow bot requested a review from franciscojavierarceo January 28, 2026 09:16

google-oss-prow bot requested a review from juliusvonkohout January 28, 2026 09:16

google-oss-prow bot added the size/XL label Jan 28, 2026

abhijeet-dhumal mentioned this pull request Jan 28, 2026

Built an MCP Server for Kubeflow SDK kubeflow/sdk#238

Open

astefanutti reviewed Jan 29, 2026

View reviewed changes

abhijeet-dhumal added 2 commits January 29, 2026 23:04

fix: add security section, HF Skills comparison, KEP-2839 links, and …

4e60427

…trainer-specific estimation Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

fix: adjust diagrams

64a3832

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

abhijeet-dhumal requested a review from astefanutti January 29, 2026 17:49

astefanutti reviewed Jan 30, 2026

View reviewed changes

andreyvelich mentioned this pull request Feb 5, 2026

feat: add MCP server foundation with discovery tools kubeflow/sdk#265

Closed

1 task


		### Trainer Selection Logic

		![Trainer Selection](trainer-selection.png)


		![Multi-MCP Ecosystem](multi-mcp.png)

		Design Principle: No overlap. `kubeflow-mcp` should handle training; `kubernetes-mcp-server` should handle generic K8s operations.

Conversation

abhijeet-dhumal commented Jan 28, 2026

Uh oh!

google-oss-prow bot commented Jan 28, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhijeet-dhumal Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

abhijeet-dhumal Feb 3, 2026 •

edited

Loading