-
Notifications
You must be signed in to change notification settings - Fork 38
CLOUDP-286686 Fix issue with race conditions for creating statefulsets #746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
MCK 1.7.0 Release NotesNew Features
Bug Fixes
|
| return workflow.Failed(err) | ||
| } | ||
| if needToRequeue { | ||
| return workflow.OK().Requeue() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced workflow.OK().Requeue() call with workflow.Pending().Requeue(). This is done by catching create.StatefulSetIsRecreating error in
mongodb-kubernetes/controllers/operator/mongodbopsmanager_controller.go
Lines 825 to 834 in 96a6704
| mutatedSts, err := r.createBackupDaemonStatefulset(ctx, reconcilerHelper, appDBConnectionString, memberCluster, initOpsManagerImage, opsManagerImage, log) | |
| if err != nil { | |
| // Check if it is a k8s error or a custom one | |
| var statefulSetIsRecreatingError create.StatefulSetIsRecreating | |
| if errors.As(err, &statefulSetIsRecreatingError) { | |
| return workflow.Pending("%s", statefulSetIsRecreatingError.Error()).Requeue() | |
| } | |
| return workflow.Failed(xerrors.Errorf("error creating Backup Daemon statefulset in member cluster %s: %w", memberCluster.Name, err)) | |
| } |
|
|
||
| // GetFilePathFromAnnotationOrDefault returns a concatenation of a default path and an annotation, or a default value | ||
| // if the annotation is not present. | ||
| func GetFilePathFromAnnotationOrDefault(sts appsv1.StatefulSet, key string, path string, defaultValue string) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused function
| fi | ||
|
|
||
| # Remove stale health check file if exists | ||
| rm -f "${MMS_LOG_DIR}/agent-health-status.json" 2>/dev/null || true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one of the important changes in the PR. With spec.persistent = true the {MMS_LOG_DIR} directory is mounted using PVC. That means during pod recreation we are not losing any logs. At the same time agent-health-status.json file which is only relevant to current container instance is preserved. This is problematic, because our readiness-probe uses this file as source of deployment status and if it is stale we can quickly mark the container as ready, while in fact it is booting up.
This line makes sure that we are deleting ${MMS_LOG_DIR}/agent-health-status.json file whenever we start the agent so it has clean state. As an alternative I've considered moving ${MMS_LOG_DIR}/agent-health-status.json to emptyDir in container spec, but this would be a larger change. If you think that it would be a better solution feel free to add comments and I will create ticket for improving current mechanism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like discussed in the meeting; we have 3 options:
- delete beginning
- change agent opt to move different path from get go
- create new emptydir and use that one instead
Generally, we want to be careful with any changes in the launcher as they are difficult to test and verify. We should decide together whether we want this solution for now and remove it later or have something different from get go
| isReady := s.updated == s.ready && | ||
| s.ready == s.wanted && | ||
| s.observedGeneration == s.generation && | ||
| s.observedGeneration == s.expectedGeneration && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the main logic change for asserting Statefulsets readiness
nammn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some quick comments, still looking at the rest. But generally LGTM, really important changes. Especially the agent-health-status file and the generation fix
changelog/20260204_fix_extended_unavailability_during_upgrade.md
Outdated
Show resolved
Hide resolved
| fi | ||
|
|
||
| # Remove stale health check file if exists | ||
| rm -f "${MMS_LOG_DIR}/agent-health-status.json" 2>/dev/null || true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like discussed in the meeting; we have 3 options:
- delete beginning
- change agent opt to move different path from get go
- create new emptydir and use that one instead
Generally, we want to be careful with any changes in the launcher as they are difficult to test and verify. We should decide together whether we want this solution for now and remove it later or have something different from get go
| if !statefulsetStatus.IsOK() { | ||
| return statefulsetStatus | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this is unnecessary? Won't that break that quicker initial deployment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might look like the reconciliation logic has changed, but in fact it didn't, just the code was simplified.
if len(processes) > 0 is equal to if !scalingFirstTime we have in other places (I will add this variable here, it makes sense to unify it with other controllers). So whenever we have processes (!scalingFirstTime) we want to check statefulset statuses one by one.
In the opposite scenario where we are scaling for the first time we create all sts'es at once, and check the the overall, merged status of all sts'es:
// [some code]
statefulsetStatus := statefulset.GetStatefulSetStatus(ctx, sts.Namespace, sts.Name, expectedGeneration, memberClient)
workflowStatus = workflowStatus.Merge(statefulsetStatus)
}
// wait for all statefulsets to become ready
if !workflowStatus.IsOK() {
return workflowStatusThis is exact same behaviour as previously, but coded in simpler way.
Julien-Ben
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great PR 👏
Let's discuss about the agent-launcher change before merging, but LGTM
|
|
||
| log.Infof("Successfully ensured StatefulSet in cluster: %s", item.ClusterName) | ||
| } else { | ||
| // We create all sts in parallel and wait below for all of them to finish |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice cleanup, I like that we're getting a bit closer to a proper state machine design
| @@ -1349,17 +1348,35 @@ func (r *ShardedClusterReconcileHelper) createKubernetesResources(ctx context.Co | |||
| } | |||
|
|
|||
| func (r *ShardedClusterReconcileHelper) createOrUpdateMongos(ctx context.Context, s *mdbv1.MongoDB, opts deploymentOptions, log *zap.SugaredLogger) workflow.Status { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good catch !
docker/mongodb-kubernetes-init-database/content/agent-launcher.sh
Outdated
Show resolved
Hide resolved
changelog/20260204_fix_extended_unavailability_during_upgrade.md
Outdated
Show resolved
Hide resolved
nammn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, not blocking on the agent launcher topic, but we should decide on slack how we want to handle this short term
| --- | ||
| kind: fix | ||
| date: 2026-02-04 | ||
| --- | ||
|
|
||
| * Fixed `Statefulset` update logic that might result in triggering rolling restart in more than one member cluster at a time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Summary
This pull request introduces several important fixes and improvements to the operator's StatefulSet management logic, particularly around multi-cluster and sharded deployments. The main focus is on making StatefulSet creation and updates more robust, improving error handling, and ensuring more reliable reconciliation during upgrades and migrations.
Several controller and helper function signatures have been updated to return mutated StatefulSet objects that are required for
meta.generationandstatus.observedGenerationcomparison.GetStatefulSetStatus race condition fix:
GetStatefulSetStatus()method signature now acceptsexpectedGenerationparam. Previously when the Statefulset was updated we didn't compare returnedmeta.generationwithstatus.observedGenerationwhen callingGetStatefulSetStatus(). This lead to race condition if theGetStatefulSetStatus()returned stale resource with previous generation.Stale
agent-health-status.jsonfile:With
spec.persistent = truethe${MMS_LOG_DIR}directory is mounted using PVC. That means during pod recreation we are not losing any logs. But at the same timeagent-health-status.jsonfile residing in logs directory should not be preserved. This is problematic, because our readiness-probe uses this file as source of deployment status and if it is stale we can quickly mark the container as ready, while in fact it is booting up. The solution is to deleteagent-health-status.jsonwhen agent boots up (in agent-launcher.sh).Mongos recreated in all clusters at once:
Previously
mongosstatefulsets were recreated immediately in all clusters. This was in opposition to existing logic forshardsandconfig-srv'swhere all-at-once update happened only during first scaling (first time deployment). This could lead to extended downtime if allmongoswere down.Removed
workflow.OK.Requeue()method that could cause state flickeringDuring the refactor I have removed two calls to
workflow.OK.Requeue()where instead we should be usingworkflow.Pending.Requeue(). In the end I have removedworkflow.OK.Requeue()completely.Proof of Work
Added new unit test
TestGetStatefulSetStatusthat checks newexpectedGenerationfield.Checklist
skip-changeloglabel if not needed