-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP-5832: Decouple PodGroup API from Workload API #5833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-5832: Decouple PodGroup API from Workload API #5833
Conversation
keps/sig-apps/3541-add-recreate-strategy-to-statefulset/README.md
Outdated
Show resolved
Hide resolved
be2191b to
0111cb0
Compare
0111cb0 to
bf21536
Compare
bf21536 to
52db292
Compare
|
|
||
| ### Risks and Mitigations | ||
|
|
||
| - Increase API calls volume: More objects means more API calls for creation, updates, and watches. The mitigation is split the responsibility.`Workload` is rarely updated (as a policy object) while `PodGroup` handles runtime state. In addition, `PodGroups` allow per-replica sharding of status updates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: ... The mitigation is to split the responsibility. The Workload object is rarely updated (as a template object), while the PodGroup handles runtime state:
| // PodGroupTemplateReference references the PodGroupTemplate object that | ||
| // defines the template used to create the PodGroup. | ||
| type PodGroupTemplateReference struct { | ||
| // WorkloadName defines the name of the Workload object (Scheduling Policy) this Pod belongs to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: I'd remove the "(Scheduling Policy)" or rename to "(Scheduling Policy Template)".
In the decoupled model, the actual policy is inside the PodGroup.
|
|
||
| - Read `pod.spec.podGroupRef.name` to identify the `PodGroup` | ||
| - Lookup the `PodGroup` object to check its existence and to get the scheduling policy | ||
| - Read `pod.spec.podGroupRef.workloadName` to identify the Workload and check its existence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. The TL;DR change for scheduler is: Instead of reading the Workload read the PodGroup. It contains all the information needed by scheduler.
I'd write it explicitly (one sentence) here - maybe as the first sentence.
Signed-off-by: helayoty <heelayot@microsoft.com>
|
|
||
| The `PodGroup` lifecycle needs to make sure that `PodGroup` will not be deleted while any pod that references it is in a non-terminal phase (i.e. not `Succeeded` or `Failed`). | ||
|
|
||
| `PodGroup` objects are created with a dedicated finalizer that the controller is responsible for removing only when the deletion-safe condition is met. The mechanism for this is: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we clarify here whether the controller is the true workload controller or one run by kube-controller-manager?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated. @wojtek-t can you please confirm the update is correct?
| - Any referencing pod is non-terminal, the controller leaves the finalizer in place and re-enqueues (i.e., on pod updates) | ||
| - To find the referencing pods, we can use an index keyed by `workloadRef.podGroupName` (and optionally namespace) so the controller can efficiently list pods that reference a given `PodGroup` | ||
|
|
||
| Deletion protection is not required for alpha (nice-to-have), however it is required for beta graduation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this section is describing new functionality that will become part of kube-controller-manager only when this KEP graduates to beta, do we need to say that true workload controllers need to handle this themselves in the meantime? Or is this only relevant to #5736 so I should mention this there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, it's more relevant in #5547
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me it's actually relevant here (it's rather infrastructure of the PodGroup API).
The reason I'm saying it's not alpha is that I want to ensure that we will be able to deliver everything. Alpha should prove the feature and the finalizer stuff is part of making it production-ready.
@nojnhuh - my claim is that given that noone will really use Alpha in production anyway, we can ignore that problem in Alpha and ensure that it's solved by the controller for Beta; I don't think you should be adding any custom logic for that on your side.
dom4ha
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All changes looks good, waiting with approval for the update to workloadRef
|
|
||
| ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? | ||
|
|
||
| The increase of CPU/MEM consumption of kube-apiserver and kube-scheduler should be negligible percentage of the current resource usage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you revisit this section especially the use of 'negligible'.
as you mentioned above in "Informers and Watches" section:
The kube-scheduler will add a new informer to watch
PodGroupobjects and stop watchingWorkloadobjects.
While the design replaces a Workload informer with a PodGroup informer, the effective cardinality changes significantly. Instead of the scheduler's cache scaling with the number of high-level Workloads, it will now scale with the number of individual PodGroups.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this section is a bit optimistic. Updated. PTAL.
Signed-off-by: helayoty <heelayot@microsoft.com>
Signed-off-by: helayoty <heelayot@microsoft.com>
|
This looks great both with my scheduling hat as well as with my PRR hat. /lgtm @dom4ha - I think this is ready for your (hopefully last) pass |
dom4ha
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
I noticed two leftovers, the rest is great. Thank you Heba!
|
|
||
| `PodGroup` status mirrors `Pod` status semantics: | ||
| - If pods are unschedulable(i.e., timeout, resources, affinity, etc.), the scheduler updates the `PodGroupScheduled` condition to `False` and sets the reason fields accordingly. | ||
| - If pods are scheduled, the scheduler updates the `PodGroupScheduled` condition to `True` after the last pod in the gang completes binding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - If pods are scheduled, the scheduler updates the `PodGroupScheduled` condition to `True` after the last pod in the gang completes binding. | |
| - If pods are scheduled, the scheduler updates the `PodGroupScheduled` condition to `True` after the group got accepted by the Permit phase. |
|
|
||
| #### GangScheduling plugin | ||
|
|
||
| The GangScheduling plugin will maintain a lister for `PodGroup` and check if the `PodGroup` object exists along with the `Workload` object. This is in addition to the following changes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The GangScheduling plugin will maintain a lister for `PodGroup` and check if the `PodGroup` object exists along with the `Workload` object. This is in addition to the following changes: | |
| The GangScheduling plugin will maintain a lister for `PodGroup` and check if the `PodGroup` object exists. This is in addition to the following changes: |
mm4tt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor comment. Other than that
/lgtm
Thanks!
|
|
||
| #### GangScheduling plugin | ||
|
|
||
| The GangScheduling plugin will maintain a lister for `PodGroup` and check if the `PodGroup` object exists along with the `Workload` object. This is in addition to the following changes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Above we're saying that scheduler will stop watching Workoad objects. Here we're saying that "will check existence of PodGroup along with the Workload". Let's make it consistent - i.e. remove the "along with the Workload object".
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dom4ha, helayoty, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
| // - "Scheduled": All required pods have been successfully scheduled. | ||
| // - "Unschedulable": The PodGroup cannot be scheduled due to resource constraints, | ||
| // affinity/anti-affinity rules, or insufficient capacity for the gang. | ||
| // - "SchedulingGated": One or more pods in the PodGroup have scheduling gates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think setting the SchedulingGated condition would be difficult since sending any synchronous API calls from the scheduling queue is undesirable. The apiserver applies this condition for scheduling gates on pods when it sees that the pod has scheduling gates. However, I don't think we can do the same for pod groups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying we should remove this? Or can we just send the API call from other place in scheduler code (e.g. a separate go-routine)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather remove it and reconsider when we have the fully working async API calls feature (#5229)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for removing it - I think that deciding exact reasons we support is the discussion for the code review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, if this is non-trivial to implement then I don't think this should be part of alpha. Removing in #5912.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove it. We can send Unschedulable status for Pods (maybe for PodGroup as well) when they are waiting for too long for in PreEnqueue on min or desired count, but it's indeed dependent on Async API calls feature which introduces such possibility.
One-line PR description: Decouple PodGroup API as a runtime object from Workload API
Issue link: WAS: Decouple PodGroup API #5832
Other comments: This PR is part of the workload-aware scheding workstream to implement gang-scheduling. It introduce an new separate runtime object,
PodGroup, that helps keepingWorkloadAPI simple and uers friendly./sig scheduling