feat(api): Fix immutability of the TrainJob APIs#3157
feat(api): Fix immutability of the TrainJob APIs#3157andreyvelich wants to merge 5 commits intokubeflow:masterfrom
Conversation
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
This PR aims to enforce immutability rules for TrainJob spec fields (primarily via CRD-level XValidation) ahead of the API’s beta graduation.
Changes:
- Add/extend CRD
XValidationimmutability rules forTrainJobSpecfields (including conditional mutability forpodTemplateOverrides). - Update integration tests to cover immutability of
initializer,trainer, and conditional updates topodTemplateOverrides. - Adjust OpenAPI/CRD schema details for
PodTemplateOverride.targetJobslist semantics.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
pkg/apis/trainer/v1alpha1/trainjob_types.go |
Adds XValidation rules to enforce immutability and conditional mutability for podTemplateOverrides. |
test/integration/webhooks/trainjob_test.go |
Adds webhook-focused immutability tests for initializer, trainer, and podTemplateOverrides. |
test/integration/controller/trainjob_controller_test.go |
Updates controller integration test to validate podTemplateOverrides behavior across suspend/unsuspend. |
manifests/base/crds/trainer.kubeflow.org_trainjobs.yaml |
Propagates the new schema validations into the base CRD manifest. |
charts/kubeflow-trainer/crds/trainer.kubeflow.org_trainjobs.yaml |
Propagates the new schema validations into the Helm CRD manifest. |
pkg/apis/trainer/v1alpha1/zz_generated.openapi.go |
Updates OpenAPI schema (notably list semantics for targetJobs) and descriptions. |
api/openapi-spec/swagger.json |
Updates published OpenAPI spec to match schema changes. |
api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_train_job_spec.py |
Updates generated Python client model descriptions to match OpenAPI changes. |
pkg/client/applyconfiguration/trainer/v1alpha1/trainjobspec.go |
Updates applyconfiguration comments/metadata to align with schema changes. |
| } | ||
|
|
||
| // TrainJobSpec represents specification of the desired TrainJob. | ||
| // +kubebuilder:validation:XValidation:rule="(has(oldSelf.suspend) && oldSelf.suspend) || self.podTemplateOverrides == oldSelf.podTemplateOverrides", message="podTemplateOverrides are immutable when suspend is false" |
There was a problem hiding this comment.
The TrainJobSpec XValidation for podTemplateOverrides gates mutability on oldSelf.suspend, which prevents updating podTemplateOverrides in the same request that sets suspend=true (oldSelf.suspend is false) and diverges from the webhook/plugin logic that checks the new spec’s suspend value.
| // +kubebuilder:validation:XValidation:rule="(has(oldSelf.suspend) && oldSelf.suspend) || self.podTemplateOverrides == oldSelf.podTemplateOverrides", message="podTemplateOverrides are immutable when suspend is false" | |
| // +kubebuilder:validation:XValidation:rule="(has(self.suspend) && self.suspend) || self.podTemplateOverrides == oldSelf.podTemplateOverrides", message="podTemplateOverrides are immutable when suspend is false" |
| return job | ||
| }, | ||
| testingutil.BeInvalidError()), | ||
| ginkgo.Entry("Should success to update podTemplateOverride when suspend is true", |
There was a problem hiding this comment.
The test case name uses ungrammatical wording (“Should success to update…”), which makes intent harder to read when scanning failures; rename to “Should succeed to update …”.
| ginkgo.Entry("Should success to update podTemplateOverride when suspend is true", | |
| ginkgo.Entry("Should succeed to update podTemplateOverride when suspend is true", |
| g.Expect(k8sClient.Get(ctx, trainJobKey, trainJob)).Should(gomega.Succeed()) | ||
| trainJob.Spec.Trainer.Image = &originImageName | ||
| g.Expect(k8sClient.Update(ctx, trainJob)).Should(gomega.Succeed()) | ||
| }, util.Timeout, util.Interval).Should(gomega.Succeed()) | ||
|
|
||
| ginkgo.By("Checking if JobSet keep having updated Trainer image") | ||
| gomega.Consistently(func(g gomega.Gomega) { | ||
| jobSet := &jobsetv1alpha2.JobSet{} | ||
| g.Expect(k8sClient.Get(ctx, trainJobKey, jobSet)).Should(gomega.Succeed()) | ||
| for _, rJob := range jobSet.Spec.ReplicatedJobs { | ||
| if rJob.Name == constants.Node { | ||
| g.Expect(rJob.Template.Spec.Template.Spec.Containers[0].Image).Should(gomega.Equal(updatedImageName)) | ||
| } | ||
| } | ||
| }, util.ConsistentDuration, util.Interval).Should(gomega.Succeed()) | ||
|
|
||
| ginkgo.By("Trying to re-suspend TrainJob and restore Trainer image") | ||
| gomega.Eventually(func(g gomega.Gomega) { | ||
| g.Expect(k8sClient.Get(ctx, trainJobKey, trainJob)) | ||
| trainJob.Spec.Suspend = ptr.To(true) | ||
| trainJob.Spec.Trainer.Image = &originImageName | ||
| g.Expect(k8sClient.Update(ctx, trainJob)).Should(gomega.Succeed()) | ||
| }, util.Timeout, util.Interval).Should(gomega.Succeed()) | ||
|
|
||
| ginkgo.By("Checking if JobSet image is restored") | ||
| gomega.Eventually(func(g gomega.Gomega) { | ||
| jobSet := &jobsetv1alpha2.JobSet{} | ||
| g.Expect(k8sClient.Get(ctx, trainJobKey, jobSet)).Should(gomega.Succeed()) | ||
| g.Expect(jobSet.Spec.Suspend).ShouldNot(gomega.BeNil()) | ||
| g.Expect(*jobSet.Spec.Suspend).Should(gomega.BeTrue()) | ||
| for _, rJob := range jobSet.Spec.ReplicatedJobs { | ||
| if rJob.Name == constants.Node { | ||
| g.Expect(rJob.Template.Spec.Template.Spec.Containers[0].Image).Should(gomega.Equal(originImageName)) | ||
| } | ||
| } | ||
| trainJob.Spec.PodTemplateOverrides = emptyPodTemplateOverrides | ||
| g.Expect(k8sClient.Update(ctx, trainJob)).ShouldNot(gomega.Succeed()) | ||
| }, util.Timeout, util.Interval).Should(gomega.Succeed()) |
There was a problem hiding this comment.
This assertion only checks that the update fails, so the test could pass due to unrelated transient errors (e.g., conflicts) rather than the intended immutability validation; assert the error type/message (e.g., apierrors.IsInvalid / forbidden with expected field path) to make the test deterministic.
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
32b1dc7 to
6ac623c
Compare
Pull Request Test Coverage Report for Build 21719581477Details
💛 - Coveralls |
| // +listType=map | ||
| // +listMapKey=name |
There was a problem hiding this comment.
Are we specifying names for PodTemplateOverride?
There was a problem hiding this comment.
Don't we want the key to me the manager as introduced after #3020?
There was a problem hiding this comment.
Good point, that will be introduced for the top level struct: #3020 (comment) after @kaisoz changes.
// +listType=map
// +listMapKey=manager
TemplateOverrides []TemplateOverride `json:"templateOverrides,omitempty"`I am curious if we want to keep // +listMapKey=name for PodTemplateOverrideTargetJob or we can remove it?
IIRC, we introduced PodTemplateOverrideTargetJob struct instead of just []string since @tenzen-y wanted to support label selectors in the future. Context: #2296 (comment)
I guess, in that case we should keep this, right?
// +listType=atomic
TargetJobs []PodTemplateOverrideTargetJob `json:"targetJobs,omitempty"`There was a problem hiding this comment.
Right, I think we can keep +listType=atomic for targetJobs.
| // PodTemplateOverride represents a custom PodTemplateSpec override that will be applied to the TrainJob's training runtime. | ||
| type PodTemplateOverride struct { | ||
| // targetJobs is the list of replicated jobs in the training runtime template to apply the overrides. | ||
| // +listType=atomic |
There was a problem hiding this comment.
No strong view. Ideally (long term) we use +listType=map, but for that we need to have a unique identifier. This could be the manager field, but for that we would either have a webhook or MAP.
| // targetJobs is the list of replicated jobs in the training runtime template to apply the overrides. | ||
| // +listType=atomic | ||
| // +listType=map | ||
| // +listMapKey=name |
There was a problem hiding this comment.
Do we also validate if job name is valid? i.e it exists in the runtime jobs
There was a problem hiding this comment.
Yeah, this is good point.
Let me create followup issue for this, since we need to use Validation Webhook for it.
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
| Initializer *Initializer `json:"initializer,omitempty"` | ||
|
|
||
| // trainer defines the configuration of the trainer. | ||
| // +kubebuilder:validation:XValidation:rule="self == oldSelf", message="field is immutable" |
There was a problem hiding this comment.
Would that make sense to be consistent with what we did in #2683 for PodTemplateOverrides that can be mutated when the TrainJob is suspended or you think we it's better to be more conservative?
There was a problem hiding this comment.
Currently, JobSet/Job doesn't allow us to modify those fields even if TrainJob is suspended (image, command, env, numNodes, etc.) Which leads to a controller error if users try to update these fields.
| } | ||
|
|
||
| // TrainJobSpec represents specification of the desired TrainJob. | ||
| // +kubebuilder:validation:XValidation:rule="self.suspend == true || (has(self.podTemplateOverrides) == has(oldSelf.podTemplateOverrides) && (!has(self.podTemplateOverrides) || self.podTemplateOverrides == oldSelf.podTemplateOverrides))", message="podTemplateOverrides are immutable when suspend is false" |
There was a problem hiding this comment.
One of the problem I see with this rule is that we also block updates to PodTemplateMetadata until TrainJob is suspended. Not sure if that is right approach.
trainer/pkg/apis/trainer/v1alpha1/trainjob_types.go
Lines 273 to 276 in e0b3ef4
I would prefer to remove this CEL rule for now, and introduce validation for PodTemplateOverrides/TemplateOverrides in the followup PR after refactoring the API
WDYT @kaisoz @astefanutti @mimowo @tenzen-y @akshaychitneni ?
There was a problem hiding this comment.
Yeah, so Kueue will need to be able to modify the podTemplateOverrides when:
- old.suspend=true -> new.suspend=false - resuming a Job - Kueue will want to inject nodeSelectors
- old.suspend=false -> new.suspend=true -> request to suspend a Job -- Kueue will want to revert the nodeSelectors
Also, any requests which preserve suspend=true -> suspend=true are fine.
I think the rules should only reject requests when old.suspend=false -> new.suspend=false.
There was a problem hiding this comment.
Let me remove this CEL for now, and we can iterate with validation in the next PRs.
c5a2345 to
db2bb3a
Compare
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
db2bb3a to
727d99d
Compare
As discussed during our Trainer calls, we want to properly enforce immutability for the TrainJob API fields. It’s best to address this now, before the API graduates to beta, when we can't introduce breaking changes.
Currently, when user tries to update
.spec.trainer.image, the JobSet controller fails due to immutability of this field.I would prefer we gradually introduce mutability when we get more use-cases from users.
/hold for review
/assign @tenzen-y @astefanutti @akshaychitneni @kaisoz
cc @mimowo @kannon92