Backport: Add ERS support for ignoring minority lagging tablets (PR #18707)#797
Closed
sbaker617 wants to merge 2 commits intoslack-22.0from
Closed
Backport: Add ERS support for ignoring minority lagging tablets (PR #18707)#797sbaker617 wants to merge 2 commits intoslack-22.0from
sbaker617 wants to merge 2 commits intoslack-22.0from
Conversation
Backports upstream PR vitessio#18707 to enable EmergencyReparentShard operations to ignore a minority of severely lagging tablets that would otherwise cause the entire reparent to timeout. This adds a WaitForRelayLogsMode enum with three modes: - ALL: Wait for all tablets (current behavior, default) - MAJORITY: Wait only for majority of most-advanced tablets - COUNT: Wait for exact number specified by operator The feature is fully backward compatible - without configuration changes, the system behaves exactly as before (ALL mode). Changes: - Added WaitForRelayLogsMode enum to protobuf definitions - Implemented filtering logic in reparentutil package - Integrated with VTOrc configuration and recovery logic - Extended gRPC API to support new parameters Upstream PR: vitessio#18707 Related Issue: vitessio#18529 Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Stephen Baker <s.baker@slack-corp.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## slack-22.0 #797 +/- ##
==============================================
- Coverage 69.77% 69.75% -0.03%
==============================================
Files 1605 1605
Lines 213999 214103 +104
==============================================
+ Hits 149324 149348 +24
- Misses 64675 64755 +80 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
MAJORITY mode is what most operators would want - it prevents minority lagging tablets from blocking ERS operations while still ensuring data consistency. Changes: - VTOrc config default: ALL -> MAJORITY - reduceValidCandidates: DEFAULT now falls through to MAJORITY - parseWaitForRelayLogsMode: default case returns MAJORITY Operators can still explicitly set ALL mode if they want to wait for all tablets. Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Stephen Baker <s.baker@slack-corp.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Backports upstream PR vitessio#18707 to enable EmergencyReparentShard (ERS) operations to ignore a minority of severely lagging tablets that would otherwise cause the entire reparent to timeout.
Problem
Severely lagging replicas can block ERS operations by timing out during relay log application. This creates operational issues where a single slow replica prevents emergency recovery.
Solution
This change introduces a
WaitForRelayLogsModeenum with three modes:Default behavior: MAJORITY mode - This is what most operators would want, as it prevents minority lagging tablets from blocking ERS while ensuring data consistency.
Configuration
VTOrc operators can configure this via:
--wait-for-relaylogs-mode(DEFAULT, ALL, MAJORITY, or COUNT) - defaults to MAJORITY--wait-for-relaylogs-tablet-count(when using COUNT mode)The gRPC API also supports these parameters in
EmergencyReparentShardRequest.Default Behavior Change
Operators who want the old behavior (wait for all tablets) can explicitly set
--wait-for-relaylogs-mode=ALL.Changes
WaitForRelayLogsModeenum to protobuf definitionsTesting
References
EmergencyReparentShard: support ignoring minority/count of lagging tablets vitessio/vitess#18707EmergencyReparentShardto fail vitessio/vitess#18529EmergencyReparentShard: include SQL thread position in most-advanced candidate selection vitessio/vitess#18531 (already merged into v22)Implementation by Claude Code with direction from @s.baker