Skip to content

Add KEP-936: Introduce Kubeflow-MCP proposal for AI-Powered Training …#937

Open
abhijeet-dhumal wants to merge 3 commits intokubeflow:masterfrom
abhijeet-dhumal:kep-kubeflow-mcp
Open

Add KEP-936: Introduce Kubeflow-MCP proposal for AI-Powered Training …#937
abhijeet-dhumal wants to merge 3 commits intokubeflow:masterfrom
abhijeet-dhumal:kep-kubeflow-mcp

Conversation

@abhijeet-dhumal
Copy link
Member

Resolves #936

…Interface

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign franciscojavierarceo for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment


### Trainer Selection Logic

![Trainer Selection](trainer-selection.png)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be useful to link to kubeflow/trainer#2839 so it's extensible and support the future trainers that will be added.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @astefanutti for reviewing 🙌
Awesome suggestion, I have added references to KEP-2839 accordingly ✅

| **Unauthorized Access** | Policy layer enforces RBAC at tool level |
| **Scope Creep** | Clear delegation to `kubernetes-mcp-server` for generic K8s ops |

## Design Details

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A section that covers the security aspects would be very valuable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a dedicated "Security Considerations" section ✅

2. **Maintenance Overhead**: MCP layer must track SDK changes
3. **Abstraction Layer**: Natural language may hide complexity users need to understand

## Alternatives

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As skills are becoming popular, a dedicated alternative section could provide context on how it compares with the proposal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch 🙌
I have added now "Alternative 4: Hugging Face Skills" with a detailed comparative analysis covering architecture, protocol, integration, and execution model differences. Also noted how Skills and MCP can be complementary, please review !

- [KEP-2170: Kubeflow Trainer V2 API](https://github.com/kubeflow/trainer/blob/master/docs/proposals/2170-kubeflow-trainer-v2/README.md)

### Related Issues
- [#936: KEP Tracking Issue](https://github.com/kubeflow/community/issues/936) - This proposal

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


| Tool | Purpose | Returns |
|------|---------|---------|
| `estimate_resources(model, peft_method)` | GPU/memory requirements | `{gpu_memory, recommended_gpus, feasible}` |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resource estimation may need to call different tools depending on the trainers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added "Note on Resource Estimation" with a table showing trainer-specific estimation methods (BuiltinTrainer uses model params, CustomTrainer uses heuristics, CustomTrainerContainer defers to user). Also noted extensibility for KEP-2839 backends. ✅

…trainer-specific estimation

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
| Method | Description | Use Case |
|--------|-------------|----------|
| **Kubeconfig** | Uses `~/.kube/config` or `KUBECONFIG` env var | Local development, CI/CD |
| **ServiceAccount Token** | Mounted at `/var/run/secrets/kubernetes.io/serviceaccount/token` | In-cluster deployment |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of in-cluster deployment, should the MCP server rather impersonate users?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point - I hadn't fully thought through the multi-user in-cluster scenario.

Please correct me here, So the flow would be:

  1. MCP server runs with a ServiceAccount that has impersonate permissions
  2. AI agent authenticates user and passes identity to MCP server
  3. MCP server can use K8s impersonation (--as / Impersonate-User header) for API calls

This way RBAC is enforced per-user even with a shared MCP deployment.

Does this align with how you'd expect it to work? I'm curious if there's a standard pattern for this - perhaps similar to how the Notebooks controller handles user identity? 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking.. user identity can be extracted from the MCP client request headers but MCP protocol doesn't have a standard way to pass user identity. So how the MCP server will know WHO to impersonate? 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@astefanutti @andreyvelich
Please correct me here but what if we use Istio/OAuth2Proxy for this case to inject user identity ? Kubeflow already uses Istio + OIDC right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if MCP server is deployed behind Kubeflow's Istio:

  • User already authenticated via Kubeflow dashboard
  • Istio adds x-forwarded-user: user@example.com to requests
  • MCP server reads header, impersonates that user for K8s API calls


![Multi-MCP Ecosystem](multi-mcp.png)

**Design Principle:** No overlap. `kubeflow-mcp` should handle training; `kubernetes-mcp-server` should handle generic K8s operations.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubeflow-mcp would eventually handle more than training.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right - I was thinking the principle should be: kubeflow-mcp owns Kubeflow-specific CRDs (TrainJob, Experiment, ModelVersion, etc.), while kubernetes-mcp-server handles generic K8s resources (PVCs, ConfigMaps, Secrets).

So the row should probably say "Kubeflow CRDs" rather than "TrainJob CRDs"?

Or do you think there's value in having separate MCP servers per Kubeflow component (trainer-mcp, katib-mcp, model-registry-mcp)?
I see Model Registry is building its own MCP server (model-registry#2029). This raises a design question 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I leaned toward unified since it mirrors the unified SDK structure, but curious what you think? Separate servers might give component teams more autonomy.


## Design Details

### MCP Tool Inventory

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the large number of tools, do you have an estimate of the size that it takes in the context based on your prototype?

Is there any concern w.r.t. to scaling the number of tools for all the Kubeflow components?

Copy link
Member Author

@abhijeet-dhumal abhijeet-dhumal Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the prototype, ~24 tools with docstrings comes to roughly 8-10K tokens.

My intuition is that this is manageable for current MCP clients, but I'm curious about your perspective. A few questions:

  1. Do you know if there's a recommended upper bound for MCP tool count? I haven't found guidance on this in the spec.
  2. For full Kubeflow coverage (Trainer + Katib + Model Registry + Pipelines), we might hit ~40-50 tools. Would lazy loading (only expose tools for installed components) be a reasonable mitigation?
  3. The other approach is policy-based filtering - a readonly user sees only ~7 discovery tools, not all 50. Does this feel like the right direction?
  4. And the discussion from above thread, should we plan separate MCP servers for each kubeflow component ? I think it's good to keep everything under 1 roof as a Unified kubeflow-mcp simialr to kubeflow-sdk which can wrap all SDK clients, single install.. but I would be happy to get more discussion or inputs on this aspect 💭

Happy to add a "Scalability Considerations" section if you think it's worth calling out explicitly.

cc: @andreyvelich @jaiakash @dhanishaphadate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

KEP: Kubeflow MCP Server - AI-Powered Training Interface

2 participants