Re-recruit backup workers to avoid recoveries by jzhou77 · Pull Request #12669 · apple/foundationdb

jzhou77 · 2026-02-02T22:09:00Z

Similar to #12558, this PR changes how backup workers are monitored and recruited to avoid recoveries. After fully_recovered state, a new actor monitorAndRecruitBackupWorkers takes over the monitoring and re-recruitment for backup workers of the current generation. Note before fully_recovered state, the monitoring is done in the original TagPartitionedLogSystem::onError_internal() actor. Monitoring backup workers for older generations is possible, but complicated by the fact that the number of log routers can be modified via a configuration change. However, this PR should be strictly better in terms of availability, as in most cases backup workers are running at fully_recovered state. Reducing recoveries improves availability.

(This PR supersedes #12564)

Details of the implementation:

During normal operation: Backup workers periodically call saveProgress() to save their progress to the database using key backupProgressKeyFor(workerID) with value containing {epoch, version, tag, totalTags}.
When a backup worker fails:
- monitorAndRecruitBackupWorkers() in ClusterController detects the failure
- Calls recruitFailedBackupWorkers() which creates InitializeBackupRequest with:
  - isReplacement = true
  - startVersion = 0 (to be determined by backup worker)
  - backupEpoch and routerTag identifying which worker failed
New backup worker starts:
- Checks if (req.isReplacement && req.startVersion == 0)
- Calls getSavedVersion(cx, workerID, backupEpoch, routerTag)
- Scans all backup progress entries to find matching tag's savedVersion
- Resumes from savedVersion + 1
Refactor log router and backup worker re-recruitment so that they share the same monitorAndRecruitWorkerSet() logic for monitoring and re-recruiting after fully_recovered.
This feature is controlled by knob CC_RERECRUIT_BACKUP_WORKER_ENABLED, default is on for more test coverage.

20260203-063209-jzhou-e78f80885b62b776 compressed=True data_size=34604030 duration=4280730 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=2:34:23 sanity=False started=100000 stopped=20260203-090632 submitted=20260203-063209 timeout=5400 username=jzhou

Code-Reviewer Section

The general pull request guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

The PR has a description, explaining both the problem and the solution.
The description mentions which forms of testing were done and the testing seems reasonable.
Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or main if this is the youngest branch)
There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

Similar to apple#12558, this PR changes how backup workers are monitored and recruited to avoid recoveries. After fully_recovered state, a new actor monitorAndRecruitBackupWorkers takes over the monitoring and re-recruitment for backup workers of the current generation. Note before fully_recovered state, the monitoring is done in the original TagPartitionedLogSystem::onError_internal() actor. Monitoring backup workers for older generations is possible, but complicated by the fact that the number of log routers can be modified via a configuration change. However, this PR should be strictly better in terms of availability, as in most cases backup workers are running at fully_recovered state. Reducing recoveries improves availability. Details of the implementation: - During normal operation: Backup workers periodically call saveProgress() to save their progress to the database using key backupProgressKeyFor(workerID) with value containing {epoch, version, tag, totalTags}. - When a backup worker fails: - monitorAndRecruitBackupWorkers() in ClusterController detects the failure - Calls recruitFailedBackupWorkers() which creates InitializeBackupRequest with: - isReplacement = true - startVersion = 0 (to be determined by backup worker) - backupEpoch and routerTag identifying which worker failed - New backup worker starts: - Checks if (req.isReplacement && req.startVersion == 0) - Calls getSavedVersion(cx, workerID, backupEpoch, routerTag) - Scans all backup progress entries to find matching tag's savedVersion - Resumes from savedVersion + 1 - Refactor log router and backup worker re-recruitment so that they share the same monitorAndRecruitWorkerSet() logic for monitoring and re-recruiting after fully_recovered. - This feature is controlled by knob CC_RERECRUIT_BACKUP_WORKER_ENABLED, default is on for more test coverage. 20260202-214556-jzhou-3737a794ed411f46

jzhou77 · 2026-02-02T22:10:26Z

fdbserver/BackupWorker.actor.cpp

 			    .detail("Version", savedVersion);
 			return;
 		}
-		ASSERT_WE_THINK(backupEpoch == oldestBackupEpoch);


There could be a new recovery, thus the worker is popping from a previous epoch.
As a result, this assertion doesn't make sense. The pop() is always safe to do
because mutations are saved already.

foundationdb-ci · 2026-02-02T22:13:28Z

Result of foundationdb-pr-clang-arm on Linux CentOS 7

Commit ID: fcd0c5e
Duration 0:04:16
Result: ❌ FAILED
Error: Error while executing command: if [[ $(git diff --shortstat 2> /dev/null | tail -n1) == "" ]]; then echo "CODE FORMAT CLEAN"; else echo "CODE FORMAT NOT CLEAN"; echo; echo "THE FOLLOWING FILES NEED TO BE FORMATTED"; echo; git ls-files -m; echo; if [[ $FDB_VERSION =~ 7\.\3. ]]; then echo skip; else exit 1; fi; fi. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-02-02T22:13:32Z

Result of foundationdb-pr-clang-ide on Linux RHEL 9

Commit ID: fcd0c5e
Duration 0:04:22
Result: ❌ FAILED
Error: Error while executing command: if [[ $(git diff --shortstat 2> /dev/null | tail -n1) == "" ]]; then echo "CODE FORMAT CLEAN"; else echo "CODE FORMAT NOT CLEAN"; echo; echo "THE FOLLOWING FILES NEED TO BE FORMATTED"; echo; git ls-files -m; echo; if [[ $FDB_VERSION =~ 7\.\3. ]]; then echo skip; else exit 1; fi; fi. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-02-02T22:13:36Z

Result of foundationdb-pr on Linux RHEL 9

Commit ID: fcd0c5e
Duration 0:04:25
Result: ❌ FAILED
Error: Error while executing command: if [[ $(git diff --shortstat 2> /dev/null | tail -n1) == "" ]]; then echo "CODE FORMAT CLEAN"; else echo "CODE FORMAT NOT CLEAN"; echo; echo "THE FOLLOWING FILES NEED TO BE FORMATTED"; echo; git ls-files -m; echo; if [[ $FDB_VERSION =~ 7\.\3. ]]; then echo skip; else exit 1; fi; fi. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-02-02T22:13:36Z

Result of foundationdb-pr-clang on Linux RHEL 9

Commit ID: fcd0c5e
Duration 0:04:23
Result: ❌ FAILED
Error: Error while executing command: if [[ $(git diff --shortstat 2> /dev/null | tail -n1) == "" ]]; then echo "CODE FORMAT CLEAN"; else echo "CODE FORMAT NOT CLEAN"; echo; echo "THE FOLLOWING FILES NEED TO BE FORMATTED"; echo; git ls-files -m; echo; if [[ $FDB_VERSION =~ 7\.\3. ]]; then echo skip; else exit 1; fi; fi. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-02-02T22:13:36Z

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

Commit ID: fcd0c5e
Duration 0:04:25
Result: ❌ FAILED
Error: Error while executing command: if [[ $(git diff --shortstat 2> /dev/null | tail -n1) == "" ]]; then echo "CODE FORMAT CLEAN"; else echo "CODE FORMAT NOT CLEAN"; echo; echo "THE FOLLOWING FILES NEED TO BE FORMATTED"; echo; git ls-files -m; echo; if [[ $FDB_VERSION =~ 7\.\3. ]]; then echo skip; else exit 1; fi; fi. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)
Cluster Test Logs zip file of the test logs (available for 30 days)

foundationdb-ci · 2026-02-02T22:16:31Z

Result of foundationdb-pr-macos on macOS Ventura 13.x

Commit ID: fcd0c5e
Duration 0:07:19
Result: ❌ FAILED
Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /usr/local/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-02-02T22:16:40Z

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

Commit ID: fcd0c5e
Duration 0:07:32
Result: ❌ FAILED
Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /opt/homebrew/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-02-02T22:56:28Z

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

Commit ID: 3e59307
Duration 0:07:15
Result: ❌ FAILED
Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /opt/homebrew/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-02-02T22:56:32Z

Result of foundationdb-pr-macos on macOS Ventura 13.x

Commit ID: 3e59307
Duration 0:07:17
Result: ❌ FAILED
Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /usr/local/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-02-02T23:11:48Z

Result of foundationdb-pr-clang-ide on Linux RHEL 9

Commit ID: 3e59307
Duration 0:22:33
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-02-02T23:32:05Z

Result of foundationdb-pr on Linux RHEL 9

Commit ID: 3e59307
Duration 0:42:35
Result: ❌ FAILED
Error: Error while executing command: ctest -j ${NPROC} --no-compress-output -T test --output-on-failure. Reason: exit status 8
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-02-02T23:32:59Z

Result of foundationdb-pr-clang-arm on Linux CentOS 7

Commit ID: 3e59307
Duration 0:43:46
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-02-02T23:40:10Z

Result of foundationdb-pr-clang on Linux RHEL 9

Commit ID: 3e59307
Duration 0:50:56
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-02-03T00:47:24Z

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

Commit ID: 3e59307
Duration 1:58:10
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)
Cluster Test Logs zip file of the test logs (available for 30 days)

20260203-063209-jzhou-e78f80885b62b776

foundationdb-ci · 2026-02-06T17:53:15Z

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

Commit ID: ed7ea10
Duration 0:07:45
Result: ❌ FAILED
Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /opt/homebrew/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-02-06T17:54:41Z

Result of foundationdb-pr-macos on macOS Ventura 13.x

Commit ID: ed7ea10
Duration 0:09:09
Result: ❌ FAILED
Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /usr/local/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-02-06T18:22:04Z

Result of foundationdb-pr-clang-ide on Linux RHEL 9

Commit ID: ed7ea10
Duration 0:36:34
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-02-06T18:29:30Z

Result of foundationdb-pr-clang-arm on Linux CentOS 7

Commit ID: ed7ea10
Duration 0:44:00
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-02-06T19:01:37Z

Result of foundationdb-pr on Linux RHEL 9

Commit ID: ed7ea10
Duration 1:16:06
Result: ❌ FAILED
Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-02-06T19:07:14Z

Result of foundationdb-pr-clang on Linux RHEL 9

Commit ID: ed7ea10
Duration 1:21:44
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-02-06T19:45:24Z

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

Commit ID: ed7ea10
Duration 1:59:56
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)
Cluster Test Logs zip file of the test logs (available for 30 days)

dlambrig · 2026-02-06T18:27:26Z

fdbserver/include/fdbserver/TagPartitionedLogSystem.actor.h


+	void updateBackupWorkers(const std::vector<int>& tagIds, const std::vector<InitializeBackupReply>& replies) final;
+
 	bool removeBackupWorker(const BackupWorkerDoneRequest& req) final;


When we remove a backup worker (either a backup completes or an older epoch has been drained), do you also remove the backup progress key associated with that worker?

If I follow, right now the keys live forever unless takeover() runs (which only happens when we re-recruit), so finished backups leave stale entries behind.

If so, it would be good to remove them in removeBackupWorker, that would declutter the read range in takeover().

jzhou77 commented Feb 2, 2026

View reviewed changes

Fix clang-format errors

3e59307

Fix backup worker start version after takeover

ed7ea10

20260203-063209-jzhou-e78f80885b62b776

jzhou77 requested review from dlambrig, sbodagala and spraza February 3, 2026 16:47

dlambrig reviewed Feb 6, 2026

View reviewed changes


		void updateBackupWorkers(const std::vector<int>& tagIds, const std::vector<InitializeBackupReply>& replies) final;

		bool removeBackupWorker(const BackupWorkerDoneRequest& req) final;

Conversation

jzhou77 commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code-Reviewer Section

For Release-Branches

Uh oh!

jzhou77 Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

foundationdb-ci commented Feb 2, 2026

Result of foundationdb-pr-clang-arm on Linux CentOS 7

Uh oh!

foundationdb-ci commented Feb 2, 2026

Result of foundationdb-pr-clang-ide on Linux RHEL 9

Uh oh!

foundationdb-ci commented Feb 2, 2026

Result of foundationdb-pr on Linux RHEL 9

Uh oh!

foundationdb-ci commented Feb 2, 2026

Result of foundationdb-pr-clang on Linux RHEL 9

Uh oh!

foundationdb-ci commented Feb 2, 2026

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

Uh oh!

foundationdb-ci commented Feb 2, 2026

Result of foundationdb-pr-macos on macOS Ventura 13.x

Uh oh!

foundationdb-ci commented Feb 2, 2026

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

Uh oh!

foundationdb-ci commented Feb 2, 2026

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

Uh oh!

foundationdb-ci commented Feb 2, 2026

Result of foundationdb-pr-macos on macOS Ventura 13.x

Uh oh!

foundationdb-ci commented Feb 2, 2026

Result of foundationdb-pr-clang-ide on Linux RHEL 9

Uh oh!

foundationdb-ci commented Feb 2, 2026

Result of foundationdb-pr on Linux RHEL 9

Uh oh!

foundationdb-ci commented Feb 2, 2026

Result of foundationdb-pr-clang-arm on Linux CentOS 7

Uh oh!

foundationdb-ci commented Feb 2, 2026

Result of foundationdb-pr-clang on Linux RHEL 9

Uh oh!

foundationdb-ci commented Feb 3, 2026

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

Uh oh!

foundationdb-ci commented Feb 6, 2026

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

Uh oh!

foundationdb-ci commented Feb 6, 2026

Result of foundationdb-pr-macos on macOS Ventura 13.x

Uh oh!

foundationdb-ci commented Feb 6, 2026

Result of foundationdb-pr-clang-ide on Linux RHEL 9

Uh oh!

foundationdb-ci commented Feb 6, 2026

Result of foundationdb-pr-clang-arm on Linux CentOS 7

Uh oh!

foundationdb-ci commented Feb 6, 2026

Result of foundationdb-pr on Linux RHEL 9

Uh oh!

foundationdb-ci commented Feb 6, 2026

Result of foundationdb-pr-clang on Linux RHEL 9

Uh oh!

foundationdb-ci commented Feb 6, 2026

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

Uh oh!

dlambrig Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

jzhou77 commented Feb 2, 2026 •

edited

Loading