Skip to content

Conversation

@helayoty
Copy link
Member

  • One-line PR description: Decouple PodGroup API as a runtime object from Workload API

  • Issue link: WAS: Decouple PodGroup API #5832

  • Other comments: This PR is part of the workload-aware scheding workstream to implement gang-scheduling. It introduce an new separate runtime object, PodGroup, that helps keeping Workload API simple and uers friendly.

/sig scheduling

@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Jan 23, 2026
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Jan 23, 2026
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Apps Jan 23, 2026
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 23, 2026
@helayoty helayoty mentioned this pull request Jan 23, 2026
4 tasks
@helayoty helayoty force-pushed the helayoty/5832-podgroup-api branch 2 times, most recently from be2191b to 0111cb0 Compare January 23, 2026 18:33
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 23, 2026
@helayoty helayoty force-pushed the helayoty/5832-podgroup-api branch from 0111cb0 to bf21536 Compare January 23, 2026 18:36
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 23, 2026
@helayoty helayoty closed this Jan 23, 2026
@helayoty helayoty force-pushed the helayoty/5832-podgroup-api branch from bf21536 to 52db292 Compare January 23, 2026 18:37
@github-project-automation github-project-automation bot moved this from Needs Triage to Closed in SIG Apps Jan 23, 2026
@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 23, 2026
@helayoty helayoty reopened this Jan 23, 2026
@github-project-automation github-project-automation bot moved this from Closed to In Progress in SIG Scheduling Jan 23, 2026
@github-project-automation github-project-automation bot moved this from Closed to In Progress in SIG Apps Jan 23, 2026
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 23, 2026

### Risks and Mitigations

- Increase API calls volume: More objects means more API calls for creation, updates, and watches. The mitigation is split the responsibility.`Workload` is rarely updated (as a policy object) while `PodGroup` handles runtime state. In addition, `PodGroups` allow per-replica sharding of status updates.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: ... The mitigation is to split the responsibility. The Workload object is rarely updated (as a template object), while the PodGroup handles runtime state:

// PodGroupTemplateReference references the PodGroupTemplate object that
// defines the template used to create the PodGroup.
type PodGroupTemplateReference struct {
// WorkloadName defines the name of the Workload object (Scheduling Policy) this Pod belongs to.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: I'd remove the "(Scheduling Policy)" or rename to "(Scheduling Policy Template)".

In the decoupled model, the actual policy is inside the PodGroup.


- Read `pod.spec.podGroupRef.name` to identify the `PodGroup`
- Lookup the `PodGroup` object to check its existence and to get the scheduling policy
- Read `pod.spec.podGroupRef.workloadName` to identify the Workload and check its existence
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. The TL;DR change for scheduler is: Instead of reading the Workload read the PodGroup. It contains all the information needed by scheduler.

I'd write it explicitly (one sentence) here - maybe as the first sentence.

Signed-off-by: helayoty <heelayot@microsoft.com>

The `PodGroup` lifecycle needs to make sure that `PodGroup` will not be deleted while any pod that references it is in a non-terminal phase (i.e. not `Succeeded` or `Failed`).

`PodGroup` objects are created with a dedicated finalizer that the controller is responsible for removing only when the deletion-safe condition is met. The mechanism for this is:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we clarify here whether the controller is the true workload controller or one run by kube-controller-manager?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated. @wojtek-t can you please confirm the update is correct?

- Any referencing pod is non-terminal, the controller leaves the finalizer in place and re-enqueues (i.e., on pod updates)
- To find the referencing pods, we can use an index keyed by `workloadRef.podGroupName` (and optionally namespace) so the controller can efficiently list pods that reference a given `PodGroup`

Deletion protection is not required for alpha (nice-to-have), however it is required for beta graduation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this section is describing new functionality that will become part of kube-controller-manager only when this KEP graduates to beta, do we need to say that true workload controllers need to handle this themselves in the meantime? Or is this only relevant to #5736 so I should mention this there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, it's more relevant in #5547

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me it's actually relevant here (it's rather infrastructure of the PodGroup API).

The reason I'm saying it's not alpha is that I want to ensure that we will be able to deliver everything. Alpha should prove the feature and the finalizer stuff is part of making it production-ready.

@nojnhuh - my claim is that given that noone will really use Alpha in production anyway, we can ignore that problem in Alpha and ensure that it's solved by the controller for Beta; I don't think you should be adding any custom logic for that on your side.

Copy link
Member

@dom4ha dom4ha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All changes looks good, waiting with approval for the update to workloadRef


###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

The increase of CPU/MEM consumption of kube-apiserver and kube-scheduler should be negligible percentage of the current resource usage.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you revisit this section especially the use of 'negligible'.

as you mentioned above in "Informers and Watches" section:

The kube-scheduler will add a new informer to watch PodGroup objects and stop watching Workload objects.

While the design replaces a Workload informer with a PodGroup informer, the effective cardinality changes significantly. Instead of the scheduler's cache scaling with the number of high-level Workloads, it will now scale with the number of individual PodGroups.

What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this section is a bit optimistic. Updated. PTAL.

Signed-off-by: helayoty <heelayot@microsoft.com>
@helayoty helayoty requested review from dom4ha, nojnhuh and x0rw February 10, 2026 22:14
Signed-off-by: helayoty <heelayot@microsoft.com>
@wojtek-t
Copy link
Member

This looks great both with my scheduling hat as well as with my PRR hat.

/lgtm
/approve PRR

@dom4ha - I think this is ready for your (hopefully last) pass

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 11, 2026
Copy link
Member

@dom4ha dom4ha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

I noticed two leftovers, the rest is great. Thank you Heba!


`PodGroup` status mirrors `Pod` status semantics:
- If pods are unschedulable(i.e., timeout, resources, affinity, etc.), the scheduler updates the `PodGroupScheduled` condition to `False` and sets the reason fields accordingly.
- If pods are scheduled, the scheduler updates the `PodGroupScheduled` condition to `True` after the last pod in the gang completes binding.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- If pods are scheduled, the scheduler updates the `PodGroupScheduled` condition to `True` after the last pod in the gang completes binding.
- If pods are scheduled, the scheduler updates the `PodGroupScheduled` condition to `True` after the group got accepted by the Permit phase.


#### GangScheduling plugin

The GangScheduling plugin will maintain a lister for `PodGroup` and check if the `PodGroup` object exists along with the `Workload` object. This is in addition to the following changes:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The GangScheduling plugin will maintain a lister for `PodGroup` and check if the `PodGroup` object exists along with the `Workload` object. This is in addition to the following changes:
The GangScheduling plugin will maintain a lister for `PodGroup` and check if the `PodGroup` object exists. This is in addition to the following changes:

Copy link
Contributor

@mm4tt mm4tt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment. Other than that
/lgtm

Thanks!


#### GangScheduling plugin

The GangScheduling plugin will maintain a lister for `PodGroup` and check if the `PodGroup` object exists along with the `Workload` object. This is in addition to the following changes:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above we're saying that scheduler will stop watching Workoad objects. Here we're saying that "will check existence of PodGroup along with the Workload". Let's make it consistent - i.e. remove the "along with the Workload object".

@dom4ha
Copy link
Member

dom4ha commented Feb 11, 2026

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dom4ha, helayoty, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 11, 2026
@k8s-ci-robot k8s-ci-robot merged commit ac4caf3 into kubernetes:master Feb 11, 2026
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.36 milestone Feb 11, 2026
@github-project-automation github-project-automation bot moved this from In Progress to Done in SIG Scheduling Feb 11, 2026
@github-project-automation github-project-automation bot moved this from In Progress to Done in SIG Apps Feb 11, 2026
// - "Scheduled": All required pods have been successfully scheduled.
// - "Unschedulable": The PodGroup cannot be scheduled due to resource constraints,
// affinity/anti-affinity rules, or insufficient capacity for the gang.
// - "SchedulingGated": One or more pods in the PodGroup have scheduling gates
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think setting the SchedulingGated condition would be difficult since sending any synchronous API calls from the scheduling queue is undesirable. The apiserver applies this condition for scheduling gates on pods when it sees that the pod has scheduling gates. However, I don't think we can do the same for pod groups.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying we should remove this? Or can we just send the API call from other place in scheduler code (e.g. a separate go-routine)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather remove it and reconsider when we have the fully working async API calls feature (#5229)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for removing it - I think that deciding exact reasons we support is the discussion for the code review

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, if this is non-trivial to implement then I don't think this should be part of alpha. Removing in #5912.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove it. We can send Unschedulable status for Pods (maybe for PodGroup as well) when they are waiting for too long for in PreEnqueue on min or desired count, but it's indeed dependent on Async API calls feature which introduces such possibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.