-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP-4671: Decoupled PodGroup and Workload API in alphav2 #5893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
mm4tt
commented
Feb 5, 2026
- One-line PR description: Decouple PodGroup and Workload API in alphav2
- Issue link: Gang Scheduling Support in Kubernetes #4671
- Other comments:
|
Skipping CI for Draft Pull Request. |
| internal structure. Those workloads include builtins like `Job` and `StatefulSet`, and custom workloads, like | ||
| `JobSet`, `LeaderWorkerSet`, `MPIJob` and `TrainJob`. All of these workload types are used for AI training and inference use cases. | ||
|
|
||
| The 1.36 revision, as detailed in the KEP-5832 and [^4], addresses feedback regarding the ambiguity of the original |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This paragraph is a valid addition, but I don't think it belongs to Motivation - I think it rather belongs to the Proposal section.
| all existing scheduling features. | ||
| - Provide full backward compatibility for all existing scheduling features | ||
|
|
||
| In 1.36: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goals should reflect the goals for the effort as a whole. So I wouldn't distinguish the 1.36 part specifically.
I think that we should just replace the existing second goal with the following:
- Implement the first version of `Workload` API necessary as a mechanism for defining scheduling policies
- Introduce a concept of a `PodGroup` positioned as runtime counterparts for the Workload
- Ensure that decoupled model of `Workload` and `PodGroup` provide clear responsibility split, improved scalability and simplified lifecycle management
| beta: "v1.37" | ||
| stable: "v1.38" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, we haven't updated beta in the previous update. Stable also needs to move to keep two releases distance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
7304b87 to
08d8c16
Compare
wojtek-t
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm overall ok with this proposal.
08d8c16 to
1abd8f6
Compare
| kind: Pod | ||
| spec: | ||
| ... | ||
| workloadRef: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's talk about names . In general, a somethingRef field has information about which Something is being referenced. The previous version of this KEP had a name of a Workload object and a name of a sub-stanza (podgroup) within it. Reasonble.
Now it is actually TWO references, IIUC. That's...a little unusual.
A few questions:
-
Do we need both references directly in pod? Or is the Pod->Workload relationship implied by the Pod->PodGroup->Workload chain? What would happen if Pod.workloadRef.name = "foo" while Pod.workloadRef.podGroupName = "pg" and PodGroup pg.workloadRef = "bar" (different workloadRef values)?
-
If we really need both in Pod, would it be awful to do something like:
Pod:
workload:
workloadName: "foo-wl",
podGroupName: "foo-pg",
or even:
Pod:
workloadName: "foo-wl",
podGroupName: "foo-pg",
-
Is it allowed to set
Pod.spec.workloadRef.podGroupNamewithoutPod.spec.workloadRef.name? How about vice-versa? -
Is it likely that we will grow other workload-related fields under Pod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Technically Pod -> PodGroup is enough and necessary for the system to function. It provides the immediate runtime context, and the scheduler explicitly needs information about the PodGroup when considering a pod for scheduling. The direct Pod -> Workload reference was included primarily for debuggability/UX, allowing users to identify the logical workload directly without traversing the runtime hierarchy. Users conceptually operate on the Workload API, while PodGroup serves as an intermediate implementation detail.
- Your suggestions (flattening/renaming) are cleaner. Overall, we could opt to not have the direct Workload reference in 1.36 and only keep the PodGroup reference. We can then revisit this in 1.37 with a concrete proposal addressing user stories (e.g. improving debuggability) and solving data inconsistency issues (e.g. via validation). OTOH, WorkloadRef is a field that already exists in the Pod API, which is an argument to just keep it (pointing to Workload) and move the PodGroup reference outside of it.
- In the decoupled model podGroupName is the essential link. So setting workloadRef.name without workloadRef.podGroupName would be invalid configuration, as the scheduler needs the specific runtime context.
- We do expect the API to grow. However, it’s not yet decided how. We are considering future extensions (see this) where we might introduce PodSubGroup. In that scenario, the “podgroup” reference would likely evolve into a oneOf pattern (pointing to either PodGroup or PodSubGroup) to strictly enforce the hierarchy level.
Given these trade-offs, we are considering the following alternatives. Do you have a preference, or perhaps a different suggestion?
- Option A
// PodSpec
workloadRef:
name: "my-workload-policy"
podGroupRef:
name: "worker-0"
Open question: Is podGroupRef the right name if we may extend it to support PodSubGroup via oneOf later?
- Option B
// PodSpec
// workloadRef: <tombstoned>
podGroupRef:
name: "worker-0"
// workloadName: "my-workload-policy" // Potential addition in 1.37+
Same question about podGroupRef name as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd go with the following updated version of option A for 1.36 and revisit this decision on 1.37. For now, it would be cleaner to get the workload name from the Pod spec to check its existence during scheduling. I expect the need of the need during integration with controllers as well.
// PodSpec
podGroupRef:
name: "worker-0"
workloadName: "wl-0"There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above has an extensibility drawback - I can imagine a future when we will have "standalone" pod groups not necessarily attached to workloads. And that would not work for it.
In general, the whole purpose of PodGroup is that it's supposed to be self-contained. So the link to podGroup from pod is critical (it provides the grouping), but the link to workload serves a very different purpose (and e.g. scheduler doesn't need it at all).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is both mechanical and about extensibility.
Mechanical: If we have two paths from a given Pod to a Workload:
Pod->PodGroup->WorkloadPod->Workload
...then those can be inconsistent (ending at 2 different Workloads). Denormalizing is OK if it makes some important UX easier (e.g. some user-story that involves pods and workload but never PodGroups), but we have to define what happens if we hit that inconsistency (or how to avoid it).
-
Is the link required in spec or can it be assigned via status? E.g.
Pod.spec.podGroupName = "foo"producesPod.status.workloadName = "bar". The data can still be present but it is less likely to be inconsistent. -
Or should we validate it at admission control (check that both paths arrive arrive at the same Workload)?
-
Or do we just omit it and see if it is ACTUALLY needed?
Extensibility: It sounds like there's a possibility that PodGroup is just one possible form of grouping, and that others may emerge (PodSubGroup is not a great name, but it illustrates the point). If this is true, then we should leave room for that expansion -- we don't want to mess around in Pod.
Is there more general term for the relationship between a Pod and PodGroup / PodSubGroup / WhateverElseMightCome Up ? It's not "parent" or "owner". "Workgroup"? e.g.
Something like:
type PodSpec struct {
// ...
// +optional
Workgroup *PodWorkgroup
}
type PodWorkgroup struct {
// one of the following must be set
// optional
PodGroupName *string
// optional
PodSubGroupName *string
}
Perhaps struct instead of strings, but you get the point. It makes the API a little verbose, but it leaves room for growth if we think there's even a slight possibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Let's remove Pod->Workload reference (at least for now). This isn't strictly needed for correctness. The UX/debuggability isn't that important and I agree we can push for now.
I really like the idea of storing that in status but let's decouple it and maybe add that as future ideas.
The above would address the first concern.
- The struct is conceptually what we already have now and for be it boils down to naming. I like that (modulo naming, but naming we can hopefully figure out during code review).
| } | ||
| ``` | ||
|
|
||
| Note: In 1.36, all fields in WorkloadReference are made optional, and the validation logic for the Name and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean to have a totally empty workloadReference ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current proposal (setting aside your "let's talk about names" comment for a moment), it is acceptable to have workloadRef unset (meaning the pod doesn't use workload-aware scheduling), but it is illegal to have it set but empty.
As noted here, validation will still be enforced, but implemented in code. We are adopting this approach to ensure the API can be extended in a backward-compatible way in the future. A concrete scenario is extending the structure with PodSubGroup in 1.37+, for which we need PodGroupName to be +optional to support a oneOf relationship (between PodGroupName and PodSubGroupName).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant is that you have both name and podGroupName as optional. which means that this is valid:
Pod:
spec:
workloadRef: {}
Which is a different thing than not specifying workloadRef at all. If you are asserting that there's a one-of situation but only one valid option (for now) then it needs to say that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rewritten. PTAL.
| // +required | ||
| Policy PodGroupPolicy | ||
| // PodGroupTemplate represents a template for a set of pods with a common policy. | ||
| type PodGroupTemplate struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't show type PodGroup -- will it use the exact same schema as this? Will it use the same types?
I would caution -- before using the same types (e.g. PodGroupSchedulingPolicy) think whether these will ALWAYS be the exact same schema or if there might be things in PodGroup that are not in the Workload template (or the reverse). Using the same type becomes embedded in the client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks for catching the missing PodGroup definition—I’ve added it to the KEP.
It will be a distinct top-level resource with its own schema. However, we intentionally plan to reuse PodGroupSchedulingPolicy and include it in PodGroupSpec. The reasoning is:
- Template->PodGroup consistency: The PodGroup instance is the direct realization of the Workload template. Any policy parameter defined in the template must exist in the instance to be enforced. Conversely, it would be illogical to include a parameter in the template's policy that has no representation in the instance.
- Runtime extension: We envision PodGroupSpec may eventually contain more information than the template (e.g., desiredCount = actual #pods - populated by a real-workload controller, which is known only at runtime). By embedding the shared PodGroupSchedulingPolicy struct within the larger PodGroupSpec, we maintain a consistent policy definition while allowing the runtime resource to carry additional, non-template state.
Does it make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any policy parameter defined in the template must exist in the instance to be enforced
Can there EVER be a field in the instance's PodGroupSchedulingPolicy which you DO NOT want to be configurable in the template? We see that in PodTermplate all the time (e.g. Deployment.spec.template.restartPolicy is present but not actually configurable)
The curren t arrangement SEEMS ok, just think about it as you move into impl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any policy parameter defined in the template must exist in the instance to be enforced
I'm not sure if this is really a must. We may need to have a field in the template is not needed to be in the PodGroup runtime object. For example (just my imagination), the SchedulingPolicy gets complex to the extent we would create another runtime option to do X function. We never know.
I'd go with creating two separate structs for each.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely can imagine usecases where we need another runtime option that we don't want to expose. But this can be achieved by extending the PodGroupSpec with new fields (as opposed to extending SchedulingPolicy).
We were discussing it with @mm4tt and @dom4ha and didn't come up with any ideas where we actually need to distinguish SchedulingPolicy in Workload vs PodGroup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need to resolve for now, but will need a hard commitment by actual code review time.
dom4ha
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GitHub messed up diffs in my comments and I could not fix that, sorry about it
| The `Workload` core resource will be introduced. A `Workload` does not create any pods. It just describes what pods the scheduler should expect to see, and how to treat them. | ||
| The `Workload` resource is a new core resource that provides scheduling policy definitions. It does not manage pod | ||
| lifecycles or interfere with the pod creation logic of controllers like `Job`, `JobSet`, or `StatefulSet`. Instead, it | ||
| serves as a policy blueprint, describing the PodGroupTemplates with their corresponding scheduling policies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit unclear what is exactly a difference between Policy and Template as for me these are a bit different approaches.
In template - instance model, PodGroups are self contained and does not need any additional Policy reference other than optional pointer to the template they were built from. In that model they may not need workloadref at all, but just a relation to its parent group if it exists. Pods could become the only ones which reference workload object.
In policy model, the responsibility and relations are more unclear to me and maybe they should be defined more clearly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I will update the text to make this clear: Workload is the template, PodGroup is self-contained (and completely enough for scheduler)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The kube-scheduler will be watching for `PodGroup` objects (using informers) and will use them to map pods | |
| to and from their `PodGroup` objects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| In the initial implementation, we expect users to create the `PodGroup` objects. In the next steps controllers | |
| will be updated to create an appropriate `PodGroup` objects themselves whenever they can appropriately infer | |
| the intention from the desired state. | |
| Note that given scheduling options are stored in the `PodGroup` object, pods linked to the `PodGroup` | |
| object will not be scheduled until this `PodGroup` object is created and observed by the kube-scheduler. |
…ined and enough for scheduler to act, i.e. scheudler doesn't need to watch Workoad objects
wojtek-t
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added few more (mostly cosmetic) comments.
| separate runtime object, `PodGroup`, represents the actual runtime instance of grouped pods. `Pods` reference their | ||
| execution context (PodGroup) via the `workloadRef` field. | ||
|
|
||
| To ensure long-term API quality, the Workload API remains in Alpha for the 1.36 cycle. This allows for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this is implementation/graduation detail.
When we will be promoting a feature to beta, this will not longer be relevant to the readers.
I wouldn't put that in summary - instead let's put it into "details proposal" section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
| metadata: | ||
| name: job-instance-worker-0 | ||
| spec: | ||
| workloadRef: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We changed the structure of referencing as decribed here:
#5833 (comment)
Can you please update to reflect it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| of a single leader and `N` workers and that forms a scheduling (and runtime unit), but workload as a whole | ||
| may consist of a number of such replicas. | ||
|
|
||
| The `PodGroup` resource will be introduced with the following structure: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's link to the PodGroup KEP here, and describe that this represents a current snapshot of the API, but the other KEP should be treated as a source of truth for the PodGroup API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. PTAL
wojtek-t
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's wait for final conclusions on 5832, but this overall LGTM now.
| kind: Pod | ||
| spec: | ||
| ... | ||
| workloadRef: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above has an extensibility drawback - I can imagine a future when we will have "standalone" pod groups not necessarily attached to workloads. And that would not work for it.
In general, the whole purpose of PodGroup is that it's supposed to be self-contained. So the link to podGroup from pod is critical (it provides the grouping), but the link to workload serves a very different purpose (and e.g. scheduler doesn't need it at all).
0f55654 to
9de2d2b
Compare
9de2d2b to
b706bbc
Compare
wojtek-t
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM now - thanks!
| // +required | ||
| Policy PodGroupPolicy | ||
| // PodGroupTemplate represents a template for a set of pods with a common policy. | ||
| type PodGroupTemplate struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely can imagine usecases where we need another runtime option that we don't want to expose. But this can be achieved by extending the PodGroupSpec with new fields (as opposed to extending SchedulingPolicy).
We were discussing it with @mm4tt and @dom4ha and didn't come up with any ideas where we actually need to distinguish SchedulingPolicy in Workload vs PodGroup.
b706bbc to
5dc23a3
Compare
|
/lgtm To wait for others |
dom4ha
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
The schedulingGroup name is a good one and I'm glad we came to the version in which we have fully self-contained scheduling tree structure (which will evolve further)
thockin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am satisified with the shape of this, modulo one question about one-of-ness for KEP. I'm going to approve, and when it comes to actual code review we can pick the smallest nits. Let's try to front-load that process?
| workloadRef: | ||
| name: job-1 | ||
| podGroup: pg1 | ||
| schedulingGroup: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this ONLY about scheduling? I think it has as much to do with lifecycle, interruptions, etc as it does with scheduling, eventually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point. I agree that these group associations will likely impact the broader lifecycle (preemption, resource attachment, etc.), but I'm still a bit hesitant about the name workgroup. To me, it feels a bit too generic – it doesn't immediately signal the purpose of the field to the user.
To me, 'scheduling' remains the primary functional anchor and the clearest entry point for users. It's the 'why' behind creating these groups in the first place.
I'm leaving it like this in KEP and I'm happy to continue the discussion during the API review. Implementation PRs will be submitted promptly to front-load the review cycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think there are two angles for looking at it:
- if we treat "scheduling" in it's strict meaning, I agree it's not ideal name
- however, if we start talk about scheduling in its wider context (e.g. cluster-autoscaling is also kind-of scheduling, just on not-existent resources, preemption is also scheduling - just for a different workloads, etc.), the things that you mentioned (and a bunch of others) kind-of fit into that category - this is my mental model
That being said - I'm not a naming expert and I'm absolutely opened to other names, but I think that as Matt wrote above we can continue this discussion during code review.
| PodGroupReplicaKey string | ||
| // PodSchedulingGroup identifies the runtime scheduling group instance that a Pod belongs to. | ||
| // The scheduler uses this information to apply workload-aware scheduling semantics. | ||
| type PodSchedulingGroup struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be described as a one-of which only has one option. What I mean is that it currently allows 3 states to exist:
This has clear meaning:
Pod:
spec:
schedulingGroup: null // <--- not specified
This has clear meaning:
Pod:
spec:
schedulingGroup:
podGroupName: "my-podgroup"
What does this mean? Is it actually allowed?
Pod:
spec:
schedulingGroup: {}
We need to be super careful that we don't make "nothing" mean something, or else we can't ever evolve it (because we cannot tell the difference between "said nothing" and "said something I don't know how to decode").
So I think I would say that the inside of PodSchedulingGroup is a one-of. Each field is optional but ONE OF THEM must be specified. There just happens to only be one option. For now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think I would say that the inside of PodSchedulingGroup is a one-of. Each field is optional but ONE OF THEM must be specified. There just happens to only be one option. For now.
Right - that's exactly what we need.
So you're saying that we need to mark it from the beginning as "oneOf" and over time just add options to it, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added +oneOf tag. Our intent is to not allow empty schedulingGroup, which I tried to express in the note below (line 356).
Note: All fields in PodSchedulingGroup are intentionally made optional. The validation logic for the PodGroupName field presence will be implemented in the code to allow for extending this structure (via oneof) with more scheduling grouping concepts (e.g. PodSubGroupName) in the future.
My understanding is that with oneOf, this extra in-code validation is not needed (the oneOf validation will enforce that the only option is set). Given that, I'm removing the note.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're still doing extra validation, but given the "oneOf" nature is now clearly reflected in the type definition, we can decide that during code review.
| // +required | ||
| Policy PodGroupPolicy | ||
| // PodGroupTemplate represents a template for a set of pods with a common policy. | ||
| type PodGroupTemplate struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need to resolve for now, but will need a hard commitment by actual code review time.
|
/approve for API |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dom4ha, mm4tt, thockin The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/lgtm |