Add KEP-936: Introduce Kubeflow-MCP proposal for AI-Powered Training …#937
Add KEP-936: Introduce Kubeflow-MCP proposal for AI-Powered Training …#937abhijeet-dhumal wants to merge 3 commits intokubeflow:masterfrom
Conversation
…Interface Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
|
||
| ### Trainer Selection Logic | ||
|
|
||
|  |
There was a problem hiding this comment.
I think it would be useful to link to kubeflow/trainer#2839 so it's extensible and support the future trainers that will be added.
There was a problem hiding this comment.
Thanks a lot @astefanutti for reviewing 🙌
Awesome suggestion, I have added references to KEP-2839 accordingly ✅
| | **Unauthorized Access** | Policy layer enforces RBAC at tool level | | ||
| | **Scope Creep** | Clear delegation to `kubernetes-mcp-server` for generic K8s ops | | ||
|
|
||
| ## Design Details |
There was a problem hiding this comment.
A section that covers the security aspects would be very valuable.
There was a problem hiding this comment.
Added a dedicated "Security Considerations" section ✅
| 2. **Maintenance Overhead**: MCP layer must track SDK changes | ||
| 3. **Abstraction Layer**: Natural language may hide complexity users need to understand | ||
|
|
||
| ## Alternatives |
There was a problem hiding this comment.
As skills are becoming popular, a dedicated alternative section could provide context on how it compares with the proposal.
There was a problem hiding this comment.
Great catch 🙌
I have added now "Alternative 4: Hugging Face Skills" with a detailed comparative analysis covering architecture, protocol, integration, and execution model differences. Also noted how Skills and MCP can be complementary, please review !
| - [KEP-2170: Kubeflow Trainer V2 API](https://github.com/kubeflow/trainer/blob/master/docs/proposals/2170-kubeflow-trainer-v2/README.md) | ||
|
|
||
| ### Related Issues | ||
| - [#936: KEP Tracking Issue](https://github.com/kubeflow/community/issues/936) - This proposal |
There was a problem hiding this comment.
kubeflow/model-registry#2029 would be added
|
|
||
| | Tool | Purpose | Returns | | ||
| |------|---------|---------| | ||
| | `estimate_resources(model, peft_method)` | GPU/memory requirements | `{gpu_memory, recommended_gpus, feasible}` | |
There was a problem hiding this comment.
Resource estimation may need to call different tools depending on the trainers.
There was a problem hiding this comment.
Added "Note on Resource Estimation" with a table showing trainer-specific estimation methods (BuiltinTrainer uses model params, CustomTrainer uses heuristics, CustomTrainerContainer defers to user). Also noted extensibility for KEP-2839 backends. ✅
…trainer-specific estimation Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
| | Method | Description | Use Case | | ||
| |--------|-------------|----------| | ||
| | **Kubeconfig** | Uses `~/.kube/config` or `KUBECONFIG` env var | Local development, CI/CD | | ||
| | **ServiceAccount Token** | Mounted at `/var/run/secrets/kubernetes.io/serviceaccount/token` | In-cluster deployment | |
There was a problem hiding this comment.
In the case of in-cluster deployment, should the MCP server rather impersonate users?
There was a problem hiding this comment.
Ah good point - I hadn't fully thought through the multi-user in-cluster scenario.
Please correct me here, So the flow would be:
- MCP server runs with a ServiceAccount that has
impersonatepermissions - AI agent authenticates user and passes identity to MCP server
- MCP server can use K8s impersonation (
--as/Impersonate-Userheader) for API calls
This way RBAC is enforced per-user even with a shared MCP deployment.
Does this align with how you'd expect it to work? I'm curious if there's a standard pattern for this - perhaps similar to how the Notebooks controller handles user identity? 🤔
There was a problem hiding this comment.
Thinking.. user identity can be extracted from the MCP client request headers but MCP protocol doesn't have a standard way to pass user identity. So how the MCP server will know WHO to impersonate? 🤔
There was a problem hiding this comment.
@astefanutti @andreyvelich
Please correct me here but what if we use Istio/OAuth2Proxy for this case to inject user identity ? Kubeflow already uses Istio + OIDC right?
There was a problem hiding this comment.
So if MCP server is deployed behind Kubeflow's Istio:
- User already authenticated via Kubeflow dashboard
- Istio adds x-forwarded-user: user@example.com to requests
- MCP server reads header, impersonates that user for K8s API calls
|
|
||
|  | ||
|
|
||
| **Design Principle:** No overlap. `kubeflow-mcp` should handle training; `kubernetes-mcp-server` should handle generic K8s operations. |
There was a problem hiding this comment.
kubeflow-mcp would eventually handle more than training.
There was a problem hiding this comment.
Ah right - I was thinking the principle should be: kubeflow-mcp owns Kubeflow-specific CRDs (TrainJob, Experiment, ModelVersion, etc.), while kubernetes-mcp-server handles generic K8s resources (PVCs, ConfigMaps, Secrets).
So the row should probably say "Kubeflow CRDs" rather than "TrainJob CRDs"?
Or do you think there's value in having separate MCP servers per Kubeflow component (trainer-mcp, katib-mcp, model-registry-mcp)?
I see Model Registry is building its own MCP server (model-registry#2029). This raises a design question 🤔
There was a problem hiding this comment.
I leaned toward unified since it mirrors the unified SDK structure, but curious what you think? Separate servers might give component teams more autonomy.
|
|
||
| ## Design Details | ||
|
|
||
| ### MCP Tool Inventory |
There was a problem hiding this comment.
Given the large number of tools, do you have an estimate of the size that it takes in the context based on your prototype?
Is there any concern w.r.t. to scaling the number of tools for all the Kubeflow components?
There was a problem hiding this comment.
From the prototype, ~24 tools with docstrings comes to roughly 8-10K tokens.
My intuition is that this is manageable for current MCP clients, but I'm curious about your perspective. A few questions:
- Do you know if there's a recommended upper bound for MCP tool count? I haven't found guidance on this in the spec.
- For full Kubeflow coverage (Trainer + Katib + Model Registry + Pipelines), we might hit ~40-50 tools. Would lazy loading (only expose tools for installed components) be a reasonable mitigation?
- The other approach is policy-based filtering - a
readonlyuser sees only ~7 discovery tools, not all 50. Does this feel like the right direction? - And the discussion from above thread, should we plan separate MCP servers for each kubeflow component ? I think it's good to keep everything under 1 roof as a Unified kubeflow-mcp simialr to kubeflow-sdk which can wrap all SDK clients, single install.. but I would be happy to get more discussion or inputs on this aspect 💭
Happy to add a "Scalability Considerations" section if you think it's worth calling out explicitly.
Resolves #936