Re-recruit backup workers to avoid recoveries#12669
Open
jzhou77 wants to merge 3 commits intoapple:mainfrom
Open
Re-recruit backup workers to avoid recoveries#12669jzhou77 wants to merge 3 commits intoapple:mainfrom
jzhou77 wants to merge 3 commits intoapple:mainfrom
Conversation
Similar to apple#12558, this PR changes how backup workers are monitored and recruited to avoid recoveries. After fully_recovered state, a new actor monitorAndRecruitBackupWorkers takes over the monitoring and re-recruitment for backup workers of the current generation. Note before fully_recovered state, the monitoring is done in the original TagPartitionedLogSystem::onError_internal() actor. Monitoring backup workers for older generations is possible, but complicated by the fact that the number of log routers can be modified via a configuration change. However, this PR should be strictly better in terms of availability, as in most cases backup workers are running at fully_recovered state. Reducing recoveries improves availability. Details of the implementation: - During normal operation: Backup workers periodically call saveProgress() to save their progress to the database using key backupProgressKeyFor(workerID) with value containing {epoch, version, tag, totalTags}. - When a backup worker fails: - monitorAndRecruitBackupWorkers() in ClusterController detects the failure - Calls recruitFailedBackupWorkers() which creates InitializeBackupRequest with: - isReplacement = true - startVersion = 0 (to be determined by backup worker) - backupEpoch and routerTag identifying which worker failed - New backup worker starts: - Checks if (req.isReplacement && req.startVersion == 0) - Calls getSavedVersion(cx, workerID, backupEpoch, routerTag) - Scans all backup progress entries to find matching tag's savedVersion - Resumes from savedVersion + 1 - Refactor log router and backup worker re-recruitment so that they share the same monitorAndRecruitWorkerSet() logic for monitoring and re-recruiting after fully_recovered. - This feature is controlled by knob CC_RERECRUIT_BACKUP_WORKER_ENABLED, default is on for more test coverage. 20260202-214556-jzhou-3737a794ed411f46
jzhou77
commented
Feb 2, 2026
| .detail("Version", savedVersion); | ||
| return; | ||
| } | ||
| ASSERT_WE_THINK(backupEpoch == oldestBackupEpoch); |
Contributor
Author
There was a problem hiding this comment.
There could be a new recovery, thus the worker is popping from a previous epoch.
As a result, this assertion doesn't make sense. The pop() is always safe to do
because mutations are saved already.
Contributor
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Contributor
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Contributor
Result of foundationdb-pr on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-clang on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Contributor
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Contributor
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Contributor
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Contributor
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Contributor
Result of foundationdb-pr on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Contributor
Result of foundationdb-pr-clang on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
20260203-063209-jzhou-e78f80885b62b776
Contributor
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Contributor
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Contributor
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Contributor
Result of foundationdb-pr on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-clang on Linux RHEL 9
|
Contributor
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
dlambrig
reviewed
Feb 6, 2026
|
|
||
| void updateBackupWorkers(const std::vector<int>& tagIds, const std::vector<InitializeBackupReply>& replies) final; | ||
|
|
||
| bool removeBackupWorker(const BackupWorkerDoneRequest& req) final; |
Contributor
There was a problem hiding this comment.
When we remove a backup worker (either a backup completes or an older epoch has been drained), do you also remove the backup progress key associated with that worker?
If I follow, right now the keys live forever unless takeover() runs (which only happens when we re-recruit), so finished backups leave stale entries behind.
If so, it would be good to remove them in removeBackupWorker, that would declutter the read range in takeover().
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Similar to #12558, this PR changes how backup workers are monitored and recruited to avoid recoveries. After fully_recovered state, a new actor monitorAndRecruitBackupWorkers takes over the monitoring and re-recruitment for backup workers of the current generation. Note before fully_recovered state, the monitoring is done in the original TagPartitionedLogSystem::onError_internal() actor. Monitoring backup workers for older generations is possible, but complicated by the fact that the number of log routers can be modified via a configuration change. However, this PR should be strictly better in terms of availability, as in most cases backup workers are running at fully_recovered state. Reducing recoveries improves availability.
(This PR supersedes #12564)
Details of the implementation:
During normal operation: Backup workers periodically call saveProgress() to save their progress to the database using key backupProgressKeyFor(workerID) with value containing {epoch, version, tag, totalTags}.
When a backup worker fails:
New backup worker starts:
Refactor log router and backup worker re-recruitment so that they share the same monitorAndRecruitWorkerSet() logic for monitoring and re-recruiting after fully_recovered.
This feature is controlled by knob CC_RERECRUIT_BACKUP_WORKER_ENABLED, default is on for more test coverage.
20260203-063209-jzhou-e78f80885b62b776 compressed=True data_size=34604030 duration=4280730 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=2:34:23 sanity=False started=100000 stopped=20260203-090632 submitted=20260203-063209 timeout=5400 username=jzhou
Code-Reviewer Section
The general pull request guidelines can be found here.
Please check each of the following things and check all boxes before accepting a PR.
For Release-Branches
If this PR is made against a release-branch, please also check the following:
release-branchormainif this is the youngest branch)