CLOUDP-286686 Fix issue with race conditions for creating statefulsets #746

MaciejKaras · 2026-02-03T20:56:42Z

Summary

This pull request introduces several important fixes and improvements to the operator's StatefulSet management logic, particularly around multi-cluster and sharded deployments. The main focus is on making StatefulSet creation and updates more robust, improving error handling, and ensuring more reliable reconciliation during upgrades and migrations.

Several controller and helper function signatures have been updated to return mutated StatefulSet objects that are required for meta.generation and status.observedGeneration comparison.

GetStatefulSetStatus race condition fix:

GetStatefulSetStatus() method signature now accepts expectedGeneration param. Previously when the Statefulset was updated we didn't compare returned meta.generation with status.observedGeneration when calling GetStatefulSetStatus(). This lead to race condition if the GetStatefulSetStatus() returned stale resource with previous generation.

Stale agent-health-status.json file:

With spec.persistent = true the ${MMS_LOG_DIR} directory is mounted using PVC. That means during pod recreation we are not losing any logs. But at the same time agent-health-status.json file residing in logs directory should not be preserved. This is problematic, because our readiness-probe uses this file as source of deployment status and if it is stale we can quickly mark the container as ready, while in fact it is booting up. The solution is to delete agent-health-status.json when agent boots up (in agent-launcher.sh).

Mongos recreated in all clusters at once:

Previously mongos statefulsets were recreated immediately in all clusters. This was in opposition to existing logic for shards and config-srv's where all-at-once update happened only during first scaling (first time deployment). This could lead to extended downtime if all mongos were down.

Removed workflow.OK.Requeue() method that could cause state flickering

During the refactor I have removed two calls to workflow.OK.Requeue() where instead we should be using workflow.Pending.Requeue(). In the end I have removed workflow.OK.Requeue() completely.

Proof of Work

Added new unit test TestGetStatefulSetStatus that checks new expectedGeneration field.

Checklist

Have you linked a jira ticket and/or is the ticket in the title?
Have you checked whether your jira ticket required DOCSP changes?
Have you added changelog file?
- use skip-changelog label if not needed
- refer to Changelog files and Release Notes section in CONTRIBUTING.md for more details

github-actions · 2026-02-03T20:57:31Z

⚠️ (this preview might not be accurate if the PR is not rebased on current master branch)

MCK 1.7.0 Release Notes

New Features

Allows users to override any Ops Manager emptyDir mount with their own PVCs via overrides statefulSet.spec.volumeClaimTemplates.
Added support for auto embeddings in MongoDB Community to automatically generate vector embeddings for the vector search data. This document can be followed for detailed documentation
MongoDBSearch: Updated the default mongodb/mongodb-search image version to 0.60.1. This is the version MCK uses if .spec.version is not specified.
Added support for configurable ValidatingWebhookConfiguration name via operator.webhook.name helm value.

Bug Fixes

Fixed an issue where the operator would crash when securityContext.readOnlyRootFilesystem=true was set in the helm chart values. The operator now creates an emptyDir volume for the webhook certificate.
Fix an issue to ensure that hosts are consistently removed from Ops Manager monitoring during AppDB scale-down events.
Fixed an issue where monitoring agents would fail after disabling TLS on a MongoDB deployment.
Persistent Volume Claim resize fix: Fixed an issue where the Operator ignored namespaces when listing PVCs, causing conflicts with resizing PVCs of the same name. Now, PVCs are filtered by both name and namespace for accurate resizing.
Fixed a panic that occurred when the domain names for a horizon was empty. Now, if the domain names are not valid (RFC 1123), the validation will fail before reconciling.
MongoDBMultiCluster, MongoDB: Fix an issue where the operator skipped host removal when an external domain was used, leaving monitoring hosts in Ops Manager even after workloads were correctly removed from the cluster.
Fix non-deterministic topologySpreadConstraints generated order (classic Go "bug" iterating over a map)
Fixed an issue where the Operator could crash when TLS certificates are configured using the certificatesSecretsPrefix field without additional TLS settings.
MongoDBOpsManager, AppDB: Block removing a member cluster while it still has non-zero members. This prevents scaling down without the preserved configuration and avoids unexpected issues.
Fixed an issue where redeploying a MongoDB resource after deletion could fail with 409 "version not available" errors due to stale agent credentials in Ops Manager.
Fixed Statefulset update logic that might result in triggering rolling restart in more than one member cluster at a time.

MaciejKaras · 2026-02-04T11:21:49Z

controllers/operator/mongodbopsmanager_controller.go

-		return workflow.Failed(err)
-	}
-	if needToRequeue {
-		return workflow.OK().Requeue()


Replaced workflow.OK().Requeue() call with workflow.Pending().Requeue(). This is done by catching create.StatefulSetIsRecreating error in

mongodb-kubernetes/controllers/operator/mongodbopsmanager_controller.go

Lines 825 to 834 in 96a6704

mutatedSts, err := r.createBackupDaemonStatefulset(ctx, reconcilerHelper, appDBConnectionString, memberCluster, initOpsManagerImage, opsManagerImage, log)

if err != nil {

// Check if it is a k8s error or a custom one

var statefulSetIsRecreatingError create.StatefulSetIsRecreating

if errors.As(err, &statefulSetIsRecreatingError) {

return workflow.Pending("%s", statefulSetIsRecreatingError.Error()).Requeue()

}

return workflow.Failed(xerrors.Errorf("error creating Backup Daemon statefulset in member cluster %s: %w", memberCluster.Name, err))

}

MaciejKaras · 2026-02-04T11:28:40Z

pkg/statefulset/statefulset_util.go


-// GetFilePathFromAnnotationOrDefault returns a concatenation of a default path and an annotation, or a default value
-// if the annotation is not present.
-func GetFilePathFromAnnotationOrDefault(sts appsv1.StatefulSet, key string, path string, defaultValue string) string {


unused function

MaciejKaras · 2026-02-04T11:36:58Z

docker/mongodb-kubernetes-init-database/content/agent-launcher.sh

 fi

+# Remove stale health check file if exists
+rm -f "${MMS_LOG_DIR}/agent-health-status.json" 2>/dev/null || true


This is one of the important changes in the PR. With spec.persistent = true the {MMS_LOG_DIR} directory is mounted using PVC. That means during pod recreation we are not losing any logs. At the same time agent-health-status.json file which is only relevant to current container instance is preserved. This is problematic, because our readiness-probe uses this file as source of deployment status and if it is stale we can quickly mark the container as ready, while in fact it is booting up.

This line makes sure that we are deleting ${MMS_LOG_DIR}/agent-health-status.json file whenever we start the agent so it has clean state. As an alternative I've considered moving ${MMS_LOG_DIR}/agent-health-status.json to emptyDir in container spec, but this would be a larger change. If you think that it would be a better solution feel free to add comments and I will create ticket for improving current mechanism.

like discussed in the meeting; we have 3 options:

delete beginning

change agent opt to move different path from get go

create new emptydir and use that one instead

Generally, we want to be careful with any changes in the launcher as they are difficult to test and verify. We should decide together whether we want this solution for now and remove it later or have something different from get go

MaciejKaras · 2026-02-05T09:24:56Z

controllers/operator/inspect/statefulset_inspector.go

 	isReady := s.updated == s.ready &&
 		s.ready == s.wanted &&
 		s.observedGeneration == s.generation &&
+		s.observedGeneration == s.expectedGeneration &&


This is the main logic change for asserting Statefulsets readiness

nammn

some quick comments, still looking at the rest. But generally LGTM, really important changes. Especially the agent-health-status file and the generation fix

changelog/20260204_fix_extended_unavailability_during_upgrade.md

controllers/operator/inspect/statefulset_inspector.go

controllers/operator/appdbreplicaset_controller.go

nammn · 2026-02-05T10:14:44Z

docker/mongodb-kubernetes-init-database/content/agent-launcher.sh

 fi

+# Remove stale health check file if exists
+rm -f "${MMS_LOG_DIR}/agent-health-status.json" 2>/dev/null || true


like discussed in the meeting; we have 3 options:

delete beginning

change agent opt to move different path from get go

create new emptydir and use that one instead

Generally, we want to be careful with any changes in the launcher as they are difficult to test and verify. We should decide together whether we want this solution for now and remove it later or have something different from get go

lsierant · 2026-02-05T14:22:53Z

controllers/operator/mongodbmultireplicaset_controller.go

+			if !statefulsetStatus.IsOK() {
+				return statefulsetStatus
 			}
-


so this is unnecessary? Won't that break that quicker initial deployment?

It might look like the reconciliation logic has changed, but in fact it didn't, just the code was simplified.

if len(processes) > 0 is equal to if !scalingFirstTime we have in other places (I will add this variable here, it makes sense to unify it with other controllers). So whenever we have processes (!scalingFirstTime) we want to check statefulset statuses one by one.

In the opposite scenario where we are scaling for the first time we create all sts'es at once, and check the the overall, merged status of all sts'es:

// [some code] statefulsetStatus := statefulset.GetStatefulSetStatus(ctx, sts.Namespace, sts.Name, expectedGeneration, memberClient) workflowStatus = workflowStatus.Merge(statefulsetStatus) } // wait for all statefulsets to become ready if !workflowStatus.IsOK() { return workflowStatus

This is exact same behaviour as previously, but coded in simpler way.

Julien-Ben

Great PR 👏
Let's discuss about the agent-launcher change before merging, but LGTM

controllers/operator/create/create.go

Julien-Ben · 2026-02-05T14:34:27Z

controllers/operator/mongodbmultireplicaset_controller.go

-
-			log.Infof("Successfully ensured StatefulSet in cluster: %s", item.ClusterName)
-		} else {
-			// We create all sts in parallel and wait below for all of them to finish


Nice cleanup, I like that we're getting a bit closer to a proper state machine design

Julien-Ben · 2026-02-05T14:35:21Z

controllers/operator/mongodbshardedcluster_controller.go

@@ -1349,17 +1348,35 @@ func (r *ShardedClusterReconcileHelper) createKubernetesResources(ctx context.Co
 }

 func (r *ShardedClusterReconcileHelper) createOrUpdateMongos(ctx context.Context, s *mdbv1.MongoDB, opts deploymentOptions, log *zap.SugaredLogger) workflow.Status {


Very good catch !

controllers/operator/mongodbopsmanager_controller.go

docker/mongodb-kubernetes-init-database/content/agent-launcher.sh

changelog/20260204_fix_extended_unavailability_during_upgrade.md

nammn

lgtm, not blocking on the agent launcher topic, but we should decide on slack how we want to handle this short term

vinilage · 2026-02-09T10:15:54Z

changelog/20260204_fix_extended_unavailability_during_upgrade.md

+---
+kind: fix
+date: 2026-02-04
+---
+
+* Fixed `Statefulset` update logic that might result in triggering rolling restart in more than one member cluster at a time.


MaciejKaras added 3 commits February 3, 2026 17:03

Fix issue with race conditions for creating statefulsets

8695452

Test fix

3a79d1a

Update test timeouts

3815276

MaciejKaras added 3 commits February 4, 2026 10:24

Fix mc-replicaset combined status

f31c92a

Code refactor

4fb0aa5

Merge branch 'master' into maciejk/CLOUDP-286686

ca1cdb5

MaciejKaras commented Feb 4, 2026

View reviewed changes

MaciejKaras added 4 commits February 4, 2026 13:49

Added and updated some unit tests

fec5727

Removed ok.Requeue() method as it allowed status flickering

96a6704

Added changelog file

da1c573

Merge branch 'master' into maciejk/CLOUDP-286686

bc88e76

MaciejKaras marked this pull request as ready for review February 4, 2026 17:32

MaciejKaras requested review from a team and vinilage as code owners February 4, 2026 17:32

MaciejKaras requested review from igor-karpukhin, lsierant and nammn and removed request for igor-karpukhin February 4, 2026 17:32

MaciejKaras commented Feb 5, 2026

View reviewed changes

nammn reviewed Feb 5, 2026

View reviewed changes

lsierant reviewed Feb 5, 2026

View reviewed changes

lsierant approved these changes Feb 5, 2026

View reviewed changes

Julien-Ben approved these changes Feb 5, 2026

View reviewed changes

MaciejKaras added 2 commits February 5, 2026 16:15

Add comment about expectedGeneration

681b508

Addressing review comments

a289fae

nammn reviewed Feb 5, 2026

View reviewed changes

changelog/20260204_fix_extended_unavailability_during_upgrade.md Outdated Show resolved Hide resolved

nammn approved these changes Feb 6, 2026

View reviewed changes

Update changelog

6b4c203

vinilage approved these changes Feb 9, 2026

View reviewed changes

MaciejKaras merged commit 77cfee0 into master Feb 9, 2026
32 of 34 checks passed

MaciejKaras deleted the maciejk/CLOUDP-286686 branch February 9, 2026 11:00

	mutatedSts, err := r.createBackupDaemonStatefulset(ctx, reconcilerHelper, appDBConnectionString, memberCluster, initOpsManagerImage, opsManagerImage, log)
	if err != nil {
	// Check if it is a k8s error or a custom one
	var statefulSetIsRecreatingError create.StatefulSetIsRecreating
	if errors.As(err, &statefulSetIsRecreatingError) {
	return workflow.Pending("%s", statefulSetIsRecreatingError.Error()).Requeue()
	}

	return workflow.Failed(xerrors.Errorf("error creating Backup Daemon statefulset in member cluster %s: %w", memberCluster.Name, err))
	}

		@@ -1349,17 +1348,35 @@ func (r *ShardedClusterReconcileHelper) createKubernetesResources(ctx context.Co
		}

		func (r ShardedClusterReconcileHelper) createOrUpdateMongos(ctx context.Context, s mdbv1.MongoDB, opts deploymentOptions, log *zap.SugaredLogger) workflow.Status {

CLOUDP-286686 Fix issue with race conditions for creating statefulsets #746

CLOUDP-286686 Fix issue with race conditions for creating statefulsets #746

Uh oh!

Conversation

MaciejKaras commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Proof of Work

Checklist

Uh oh!

github-actions bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

MCK 1.7.0 Release Notes

New Features

Bug Fixes

Uh oh!

MaciejKaras Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaciejKaras Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nammn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Julien-Ben left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nammn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

MaciejKaras commented Feb 3, 2026 •

edited

Loading

github-actions bot commented Feb 3, 2026 •

edited

Loading

MaciejKaras Feb 4, 2026 •

edited

Loading

MaciejKaras Feb 4, 2026 •

edited

Loading