Skip to content

feat(api): Fix immutability of the TrainJob APIs#3157

Open
andreyvelich wants to merge 5 commits intokubeflow:masterfrom
andreyvelich:fix-immutable-apis
Open

feat(api): Fix immutability of the TrainJob APIs#3157
andreyvelich wants to merge 5 commits intokubeflow:masterfrom
andreyvelich:fix-immutable-apis

Conversation

@andreyvelich
Copy link
Member

@andreyvelich andreyvelich commented Feb 3, 2026

As discussed during our Trainer calls, we want to properly enforce immutability for the TrainJob API fields. It’s best to address this now, before the API graduates to beta, when we can't introduce breaking changes.

Currently, when user tries to update .spec.trainer.image, the JobSet controller fails due to immutability of this field.

I would prefer we gradually introduce mutability when we get more use-cases from users.

/hold for review
/assign @tenzen-y @astefanutti @akshaychitneni @kaisoz
cc @mimowo @kannon92

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Copilot AI review requested due to automatic review settings February 3, 2026 01:53
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from andreyvelich. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to enforce immutability rules for TrainJob spec fields (primarily via CRD-level XValidation) ahead of the API’s beta graduation.

Changes:

  • Add/extend CRD XValidation immutability rules for TrainJobSpec fields (including conditional mutability for podTemplateOverrides).
  • Update integration tests to cover immutability of initializer, trainer, and conditional updates to podTemplateOverrides.
  • Adjust OpenAPI/CRD schema details for PodTemplateOverride.targetJobs list semantics.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pkg/apis/trainer/v1alpha1/trainjob_types.go Adds XValidation rules to enforce immutability and conditional mutability for podTemplateOverrides.
test/integration/webhooks/trainjob_test.go Adds webhook-focused immutability tests for initializer, trainer, and podTemplateOverrides.
test/integration/controller/trainjob_controller_test.go Updates controller integration test to validate podTemplateOverrides behavior across suspend/unsuspend.
manifests/base/crds/trainer.kubeflow.org_trainjobs.yaml Propagates the new schema validations into the base CRD manifest.
charts/kubeflow-trainer/crds/trainer.kubeflow.org_trainjobs.yaml Propagates the new schema validations into the Helm CRD manifest.
pkg/apis/trainer/v1alpha1/zz_generated.openapi.go Updates OpenAPI schema (notably list semantics for targetJobs) and descriptions.
api/openapi-spec/swagger.json Updates published OpenAPI spec to match schema changes.
api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_train_job_spec.py Updates generated Python client model descriptions to match OpenAPI changes.
pkg/client/applyconfiguration/trainer/v1alpha1/trainjobspec.go Updates applyconfiguration comments/metadata to align with schema changes.

}

// TrainJobSpec represents specification of the desired TrainJob.
// +kubebuilder:validation:XValidation:rule="(has(oldSelf.suspend) && oldSelf.suspend) || self.podTemplateOverrides == oldSelf.podTemplateOverrides", message="podTemplateOverrides are immutable when suspend is false"
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TrainJobSpec XValidation for podTemplateOverrides gates mutability on oldSelf.suspend, which prevents updating podTemplateOverrides in the same request that sets suspend=true (oldSelf.suspend is false) and diverges from the webhook/plugin logic that checks the new spec’s suspend value.

Suggested change
// +kubebuilder:validation:XValidation:rule="(has(oldSelf.suspend) && oldSelf.suspend) || self.podTemplateOverrides == oldSelf.podTemplateOverrides", message="podTemplateOverrides are immutable when suspend is false"
// +kubebuilder:validation:XValidation:rule="(has(self.suspend) && self.suspend) || self.podTemplateOverrides == oldSelf.podTemplateOverrides", message="podTemplateOverrides are immutable when suspend is false"

Copilot uses AI. Check for mistakes.
return job
},
testingutil.BeInvalidError()),
ginkgo.Entry("Should success to update podTemplateOverride when suspend is true",
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test case name uses ungrammatical wording (“Should success to update…”), which makes intent harder to read when scanning failures; rename to “Should succeed to update …”.

Suggested change
ginkgo.Entry("Should success to update podTemplateOverride when suspend is true",
ginkgo.Entry("Should succeed to update podTemplateOverride when suspend is true",

Copilot uses AI. Check for mistakes.
Comment on lines 303 to 306
g.Expect(k8sClient.Get(ctx, trainJobKey, trainJob)).Should(gomega.Succeed())
trainJob.Spec.Trainer.Image = &originImageName
g.Expect(k8sClient.Update(ctx, trainJob)).Should(gomega.Succeed())
}, util.Timeout, util.Interval).Should(gomega.Succeed())

ginkgo.By("Checking if JobSet keep having updated Trainer image")
gomega.Consistently(func(g gomega.Gomega) {
jobSet := &jobsetv1alpha2.JobSet{}
g.Expect(k8sClient.Get(ctx, trainJobKey, jobSet)).Should(gomega.Succeed())
for _, rJob := range jobSet.Spec.ReplicatedJobs {
if rJob.Name == constants.Node {
g.Expect(rJob.Template.Spec.Template.Spec.Containers[0].Image).Should(gomega.Equal(updatedImageName))
}
}
}, util.ConsistentDuration, util.Interval).Should(gomega.Succeed())

ginkgo.By("Trying to re-suspend TrainJob and restore Trainer image")
gomega.Eventually(func(g gomega.Gomega) {
g.Expect(k8sClient.Get(ctx, trainJobKey, trainJob))
trainJob.Spec.Suspend = ptr.To(true)
trainJob.Spec.Trainer.Image = &originImageName
g.Expect(k8sClient.Update(ctx, trainJob)).Should(gomega.Succeed())
}, util.Timeout, util.Interval).Should(gomega.Succeed())

ginkgo.By("Checking if JobSet image is restored")
gomega.Eventually(func(g gomega.Gomega) {
jobSet := &jobsetv1alpha2.JobSet{}
g.Expect(k8sClient.Get(ctx, trainJobKey, jobSet)).Should(gomega.Succeed())
g.Expect(jobSet.Spec.Suspend).ShouldNot(gomega.BeNil())
g.Expect(*jobSet.Spec.Suspend).Should(gomega.BeTrue())
for _, rJob := range jobSet.Spec.ReplicatedJobs {
if rJob.Name == constants.Node {
g.Expect(rJob.Template.Spec.Template.Spec.Containers[0].Image).Should(gomega.Equal(originImageName))
}
}
trainJob.Spec.PodTemplateOverrides = emptyPodTemplateOverrides
g.Expect(k8sClient.Update(ctx, trainJob)).ShouldNot(gomega.Succeed())
}, util.Timeout, util.Interval).Should(gomega.Succeed())
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion only checks that the update fails, so the test could pass due to unrelated transient errors (e.g., conflicts) rather than the intended immutability validation; assert the error type/message (e.g., apierrors.IsInvalid / forbidden with expected field path) to make the test deterministic.

Copilot uses AI. Check for mistakes.
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@coveralls
Copy link

coveralls commented Feb 3, 2026

Pull Request Test Coverage Report for Build 21719581477

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 51.217%

Totals Coverage Status
Change from base Build 21715897523: 0.0%
Covered Lines: 1241
Relevant Lines: 2423

💛 - Coveralls

Comment on lines 269 to 270
// +listType=map
// +listMapKey=name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we specifying names for PodTemplateOverride?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we want the key to me the manager as introduced after #3020?

Copy link
Member Author

@andreyvelich andreyvelich Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, that will be introduced for the top level struct: #3020 (comment) after @kaisoz changes.

// +listType=map
// +listMapKey=manager
TemplateOverrides []TemplateOverride `json:"templateOverrides,omitempty"`

I am curious if we want to keep // +listMapKey=name for PodTemplateOverrideTargetJob or we can remove it?

IIRC, we introduced PodTemplateOverrideTargetJob struct instead of just []string since @tenzen-y wanted to support label selectors in the future. Context: #2296 (comment)

I guess, in that case we should keep this, right?

// +listType=atomic
TargetJobs []PodTemplateOverrideTargetJob `json:"targetJobs,omitempty"`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I think we can keep +listType=atomic for targetJobs.

// PodTemplateOverride represents a custom PodTemplateSpec override that will be applied to the TrainJob's training runtime.
type PodTemplateOverride struct {
// targetJobs is the list of replicated jobs in the training runtime template to apply the overrides.
// +listType=atomic
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember we had a discussion on this w.r.t. Kueue manager field.
@tenzen-y @kaisoz @mimowo if you have views on this.

Copy link

@mimowo mimowo Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong view. Ideally (long term) we use +listType=map, but for that we need to have a unique identifier. This could be the manager field, but for that we would either have a webhook or MAP.

// targetJobs is the list of replicated jobs in the training runtime template to apply the overrides.
// +listType=atomic
// +listType=map
// +listMapKey=name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also validate if job name is valid? i.e it exists in the runtime jobs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is good point.
Let me create followup issue for this, since we need to use Validation Webhook for it.

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Initializer *Initializer `json:"initializer,omitempty"`

// trainer defines the configuration of the trainer.
// +kubebuilder:validation:XValidation:rule="self == oldSelf", message="field is immutable"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that make sense to be consistent with what we did in #2683 for PodTemplateOverrides that can be mutated when the TrainJob is suspended or you think we it's better to be more conservative?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, JobSet/Job doesn't allow us to modify those fields even if TrainJob is suspended (image, command, env, numNodes, etc.) Which leads to a controller error if users try to update these fields.

}

// TrainJobSpec represents specification of the desired TrainJob.
// +kubebuilder:validation:XValidation:rule="self.suspend == true || (has(self.podTemplateOverrides) == has(oldSelf.podTemplateOverrides) && (!has(self.podTemplateOverrides) || self.podTemplateOverrides == oldSelf.podTemplateOverrides))", message="podTemplateOverrides are immutable when suspend is false"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the problem I see with this rule is that we also block updates to PodTemplateMetadata until TrainJob is suspended. Not sure if that is right approach.

// metadata overrides the Pod template metadata.
// These values will be merged with the TrainingRuntime's Pod template metadata.
// +optional
Metadata *metav1.ObjectMeta `json:"metadata,omitempty"`

I would prefer to remove this CEL rule for now, and introduce validation for PodTemplateOverrides/TemplateOverrides in the followup PR after refactoring the API

WDYT @kaisoz @astefanutti @mimowo @tenzen-y @akshaychitneni ?

Copy link

@mimowo mimowo Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so Kueue will need to be able to modify the podTemplateOverrides when:

  1. old.suspend=true -> new.suspend=false - resuming a Job - Kueue will want to inject nodeSelectors
  2. old.suspend=false -> new.suspend=true -> request to suspend a Job -- Kueue will want to revert the nodeSelectors

Also, any requests which preserve suspend=true -> suspend=true are fine.

I think the rules should only reject requests when old.suspend=false -> new.suspend=false.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me remove this CEL for now, and we can iterate with validation in the next PRs.

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants