Skip to content

Conversation

@MaciejKaras
Copy link
Collaborator

@MaciejKaras MaciejKaras commented Feb 3, 2026

Summary

This pull request introduces several important fixes and improvements to the operator's StatefulSet management logic, particularly around multi-cluster and sharded deployments. The main focus is on making StatefulSet creation and updates more robust, improving error handling, and ensuring more reliable reconciliation during upgrades and migrations.

Several controller and helper function signatures have been updated to return mutated StatefulSet objects that are required for meta.generation and status.observedGeneration comparison.

GetStatefulSetStatus race condition fix:

GetStatefulSetStatus() method signature now accepts expectedGeneration param. Previously when the Statefulset was updated we didn't compare returned meta.generation with status.observedGeneration when calling GetStatefulSetStatus(). This lead to race condition if the GetStatefulSetStatus() returned stale resource with previous generation.

Stale agent-health-status.json file:

With spec.persistent = true the ${MMS_LOG_DIR} directory is mounted using PVC. That means during pod recreation we are not losing any logs. But at the same time agent-health-status.json file residing in logs directory should not be preserved. This is problematic, because our readiness-probe uses this file as source of deployment status and if it is stale we can quickly mark the container as ready, while in fact it is booting up. The solution is to delete agent-health-status.json when agent boots up (in agent-launcher.sh).

Mongos recreated in all clusters at once:

Previously mongos statefulsets were recreated immediately in all clusters. This was in opposition to existing logic for shards and config-srv's where all-at-once update happened only during first scaling (first time deployment). This could lead to extended downtime if all mongos were down.

Removed workflow.OK.Requeue() method that could cause state flickering

During the refactor I have removed two calls to workflow.OK.Requeue() where instead we should be using workflow.Pending.Requeue(). In the end I have removed workflow.OK.Requeue() completely.

Proof of Work

Added new unit test TestGetStatefulSetStatus that checks new expectedGeneration field.

Checklist

  • Have you linked a jira ticket and/or is the ticket in the title?
  • Have you checked whether your jira ticket required DOCSP changes?
  • Have you added changelog file?

@github-actions
Copy link

github-actions bot commented Feb 3, 2026

⚠️ (this preview might not be accurate if the PR is not rebased on current master branch)

MCK 1.7.0 Release Notes

New Features

  • Allows users to override any Ops Manager emptyDir mount with their own PVCs via overrides statefulSet.spec.volumeClaimTemplates.
  • Added support for auto embeddings in MongoDB Community to automatically generate vector embeddings for the vector search data. This document can be followed for detailed documentation
  • MongoDBSearch: Updated the default mongodb/mongodb-search image version to 0.60.1. This is the version MCK uses if .spec.version is not specified.
  • Added support for configurable ValidatingWebhookConfiguration name via operator.webhook.name helm value.

Bug Fixes

  • Fixed an issue where the operator would crash when securityContext.readOnlyRootFilesystem=true was set in the helm chart values. The operator now creates an emptyDir volume for the webhook certificate.
  • Fix an issue to ensure that hosts are consistently removed from Ops Manager monitoring during AppDB scale-down events.
  • Fixed an issue where monitoring agents would fail after disabling TLS on a MongoDB deployment.
  • Persistent Volume Claim resize fix: Fixed an issue where the Operator ignored namespaces when listing PVCs, causing conflicts with resizing PVCs of the same name. Now, PVCs are filtered by both name and namespace for accurate resizing.
  • Fixed a panic that occurred when the domain names for a horizon was empty. Now, if the domain names are not valid (RFC 1123), the validation will fail before reconciling.
  • MongoDBMultiCluster, MongoDB: Fix an issue where the operator skipped host removal when an external domain was used, leaving monitoring hosts in Ops Manager even after workloads were correctly removed from the cluster.
  • Fix non-deterministic topologySpreadConstraints generated order (classic Go "bug" iterating over a map)
  • Fixed an issue where the Operator could crash when TLS certificates are configured using the certificatesSecretsPrefix field without additional TLS settings.
  • MongoDBOpsManager, AppDB: Block removing a member cluster while it still has non-zero members. This prevents scaling down without the preserved configuration and avoids unexpected issues.
  • Fixed an issue where redeploying a MongoDB resource after deletion could fail with 409 "version not available" errors due to stale agent credentials in Ops Manager.
  • Fixed Statefulset update logic that might result in triggering rolling restart in more than one member cluster at a time.

return workflow.Failed(err)
}
if needToRequeue {
return workflow.OK().Requeue()
Copy link
Collaborator Author

@MaciejKaras MaciejKaras Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced workflow.OK().Requeue() call with workflow.Pending().Requeue(). This is done by catching create.StatefulSetIsRecreating error in

mutatedSts, err := r.createBackupDaemonStatefulset(ctx, reconcilerHelper, appDBConnectionString, memberCluster, initOpsManagerImage, opsManagerImage, log)
if err != nil {
// Check if it is a k8s error or a custom one
var statefulSetIsRecreatingError create.StatefulSetIsRecreating
if errors.As(err, &statefulSetIsRecreatingError) {
return workflow.Pending("%s", statefulSetIsRecreatingError.Error()).Requeue()
}
return workflow.Failed(xerrors.Errorf("error creating Backup Daemon statefulset in member cluster %s: %w", memberCluster.Name, err))
}


// GetFilePathFromAnnotationOrDefault returns a concatenation of a default path and an annotation, or a default value
// if the annotation is not present.
func GetFilePathFromAnnotationOrDefault(sts appsv1.StatefulSet, key string, path string, defaultValue string) string {
Copy link
Collaborator Author

@MaciejKaras MaciejKaras Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused function

fi

# Remove stale health check file if exists
rm -f "${MMS_LOG_DIR}/agent-health-status.json" 2>/dev/null || true
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the important changes in the PR. With spec.persistent = true the {MMS_LOG_DIR} directory is mounted using PVC. That means during pod recreation we are not losing any logs. At the same time agent-health-status.json file which is only relevant to current container instance is preserved. This is problematic, because our readiness-probe uses this file as source of deployment status and if it is stale we can quickly mark the container as ready, while in fact it is booting up.

This line makes sure that we are deleting ${MMS_LOG_DIR}/agent-health-status.json file whenever we start the agent so it has clean state. As an alternative I've considered moving ${MMS_LOG_DIR}/agent-health-status.json to emptyDir in container spec, but this would be a larger change. If you think that it would be a better solution feel free to add comments and I will create ticket for improving current mechanism.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like discussed in the meeting; we have 3 options:

  • delete beginning
  • change agent opt to move different path from get go
  • create new emptydir and use that one instead

Generally, we want to be careful with any changes in the launcher as they are difficult to test and verify. We should decide together whether we want this solution for now and remove it later or have something different from get go

@MaciejKaras MaciejKaras marked this pull request as ready for review February 4, 2026 17:32
@MaciejKaras MaciejKaras requested review from a team and vinilage as code owners February 4, 2026 17:32
@MaciejKaras MaciejKaras requested review from igor-karpukhin, lsierant and nammn and removed request for igor-karpukhin February 4, 2026 17:32
isReady := s.updated == s.ready &&
s.ready == s.wanted &&
s.observedGeneration == s.generation &&
s.observedGeneration == s.expectedGeneration &&
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main logic change for asserting Statefulsets readiness

Copy link
Collaborator

@nammn nammn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some quick comments, still looking at the rest. But generally LGTM, really important changes. Especially the agent-health-status file and the generation fix

fi

# Remove stale health check file if exists
rm -f "${MMS_LOG_DIR}/agent-health-status.json" 2>/dev/null || true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like discussed in the meeting; we have 3 options:

  • delete beginning
  • change agent opt to move different path from get go
  • create new emptydir and use that one instead

Generally, we want to be careful with any changes in the launcher as they are difficult to test and verify. We should decide together whether we want this solution for now and remove it later or have something different from get go

if !statefulsetStatus.IsOK() {
return statefulsetStatus
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is unnecessary? Won't that break that quicker initial deployment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might look like the reconciliation logic has changed, but in fact it didn't, just the code was simplified.

if len(processes) > 0 is equal to if !scalingFirstTime we have in other places (I will add this variable here, it makes sense to unify it with other controllers). So whenever we have processes (!scalingFirstTime) we want to check statefulset statuses one by one.

In the opposite scenario where we are scaling for the first time we create all sts'es at once, and check the the overall, merged status of all sts'es:

    // [some code]
    statefulsetStatus := statefulset.GetStatefulSetStatus(ctx, sts.Namespace, sts.Name, expectedGeneration, memberClient)
    
    workflowStatus = workflowStatus.Merge(statefulsetStatus)
}

// wait for all statefulsets to become ready
if !workflowStatus.IsOK() {
    return workflowStatus

This is exact same behaviour as previously, but coded in simpler way.

Copy link
Collaborator

@Julien-Ben Julien-Ben left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR 👏
Let's discuss about the agent-launcher change before merging, but LGTM


log.Infof("Successfully ensured StatefulSet in cluster: %s", item.ClusterName)
} else {
// We create all sts in parallel and wait below for all of them to finish
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice cleanup, I like that we're getting a bit closer to a proper state machine design

@@ -1349,17 +1348,35 @@ func (r *ShardedClusterReconcileHelper) createKubernetesResources(ctx context.Co
}

func (r *ShardedClusterReconcileHelper) createOrUpdateMongos(ctx context.Context, s *mdbv1.MongoDB, opts deploymentOptions, log *zap.SugaredLogger) workflow.Status {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good catch !

Copy link
Collaborator

@nammn nammn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, not blocking on the agent launcher topic, but we should decide on slack how we want to handle this short term

Comment on lines +1 to +6
---
kind: fix
date: 2026-02-04
---

* Fixed `Statefulset` update logic that might result in triggering rolling restart in more than one member cluster at a time.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@MaciejKaras MaciejKaras merged commit 77cfee0 into master Feb 9, 2026
32 of 34 checks passed
@MaciejKaras MaciejKaras deleted the maciejk/CLOUDP-286686 branch February 9, 2026 11:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants