Skip to content

Comments

Backport: Add ERS support for ignoring minority lagging tablets (PR #18707)#797

Closed
sbaker617 wants to merge 2 commits intoslack-22.0from
backport-18531-18707
Closed

Backport: Add ERS support for ignoring minority lagging tablets (PR #18707)#797
sbaker617 wants to merge 2 commits intoslack-22.0from
backport-18531-18707

Conversation

@sbaker617
Copy link

@sbaker617 sbaker617 commented Feb 12, 2026

Summary

Backports upstream PR vitessio#18707 to enable EmergencyReparentShard (ERS) operations to ignore a minority of severely lagging tablets that would otherwise cause the entire reparent to timeout.

Problem

Severely lagging replicas can block ERS operations by timing out during relay log application. This creates operational issues where a single slow replica prevents emergency recovery.

Solution

This change introduces a WaitForRelayLogsMode enum with three modes:

  • MAJORITY (default): Wait only for majority of most-advanced tablets
  • ALL: Wait for all tablets
  • COUNT: Wait for exact number specified by operator

Default behavior: MAJORITY mode - This is what most operators would want, as it prevents minority lagging tablets from blocking ERS while ensuring data consistency.

Configuration

VTOrc operators can configure this via:

  • --wait-for-relaylogs-mode (DEFAULT, ALL, MAJORITY, or COUNT) - defaults to MAJORITY
  • --wait-for-relaylogs-tablet-count (when using COUNT mode)

The gRPC API also supports these parameters in EmergencyReparentShardRequest.

Default Behavior Change

⚠️ This changes default ERS behavior: Previously, ERS would wait for ALL tablets. Now it defaults to MAJORITY mode, which means severely lagging tablets will be ignored if they're in the minority.

Operators who want the old behavior (wait for all tablets) can explicitly set --wait-for-relaylogs-mode=ALL.

Changes

  • Added WaitForRelayLogsMode enum to protobuf definitions
  • Implemented filtering logic in reparentutil package (sorting, majority calculation)
  • Integrated with VTOrc configuration and recovery logic
  • Extended gRPC API to support new parameters
  • Default mode set to MAJORITY
  • All existing tests pass

Testing

  • All existing tests pass
  • Builds successfully
  • Backward compatibility: operators can explicitly use ALL mode

References


Implementation by Claude Code with direction from @s.baker

Backports upstream PR vitessio#18707 to enable EmergencyReparentShard operations
to ignore a minority of severely lagging tablets that would otherwise
cause the entire reparent to timeout.

This adds a WaitForRelayLogsMode enum with three modes:
- ALL: Wait for all tablets (current behavior, default)
- MAJORITY: Wait only for majority of most-advanced tablets
- COUNT: Wait for exact number specified by operator

The feature is fully backward compatible - without configuration changes,
the system behaves exactly as before (ALL mode).

Changes:
- Added WaitForRelayLogsMode enum to protobuf definitions
- Implemented filtering logic in reparentutil package
- Integrated with VTOrc configuration and recovery logic
- Extended gRPC API to support new parameters

Upstream PR: vitessio#18707
Related Issue: vitessio#18529

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Stephen Baker <s.baker@slack-corp.com>
@github-actions github-actions bot added this to the v22.0.3 milestone Feb 12, 2026
@sbaker617 sbaker617 requested a review from tanjinx February 12, 2026 22:34
@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 19.46903% with 91 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.75%. Comparing base (946a513) to head (1ccdbd8).

Files with missing lines Patch % Lines
go/vt/vtctl/reparentutil/replication.go 8.69% 63 Missing ⚠️
go/vt/vtorc/config/config.go 0.00% 14 Missing ⚠️
go/vt/vtorc/logic/topology_recovery.go 0.00% 12 Missing ⚠️
go/cmd/vtorc/cli/cli.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff               @@
##           slack-22.0     #797      +/-   ##
==============================================
- Coverage       69.77%   69.75%   -0.03%     
==============================================
  Files            1605     1605              
  Lines          213999   214103     +104     
==============================================
+ Hits           149324   149348      +24     
- Misses          64675    64755      +80     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

MAJORITY mode is what most operators would want - it prevents minority
lagging tablets from blocking ERS operations while still ensuring data
consistency.

Changes:
- VTOrc config default: ALL -> MAJORITY
- reduceValidCandidates: DEFAULT now falls through to MAJORITY
- parseWaitForRelayLogsMode: default case returns MAJORITY

Operators can still explicitly set ALL mode if they want to wait for
all tablets.

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Stephen Baker <s.baker@slack-corp.com>
@sbaker617 sbaker617 closed this Feb 13, 2026
@sbaker617 sbaker617 deleted the backport-18531-18707 branch February 13, 2026 00:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants