[`slack-22.0`] Improve PlannedReparentShard timeout error messages by tanjinx · Pull Request #786 · slackhq/vitess

tanjinx · 2026-01-30T19:35:33Z

Add detailed context to timeout errors in PlannedReparentShard to help operators diagnose failures. Changes include:

Include tablet type in replica reparent failures

This helps quickly identify whether timeouts are due to unreachable tablets, slow replication, or semi-sync configuration issues.

before:

Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
E0130 10:41:07.622577 1661946 legacy_shim.go:97] remote error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
E0130 10:41:07.627387 1661946 main.go:56] remote error: rpc error: code = DeadlineExceeded desc = context deadline exceeded

after:

E0130 13:37:55.386095 1799345 main.go:56] rpc error: code = Unknown desc = failed to verify tablet us_east_1e-0171543813 is reachable: rpc error: code = Unknown desc = TabletManager.GetGlobalStatusVars on us_east_1e-0171543813: net.Dial(/mnt/vitess/mysql/datadir/mysql.sock) to local server failed: dial unix /mnt/vitess/mysql/datadir/mysql.sock: connect: resource temporarily unavailable (errno 2002) (sqlstate HY000)

Description

Related Issue(s)

Checklist

"Backport to:" labels have been added if this change should be back-ported to release branches
If this change is to be back-ported to previous releases, a justification is included in the PR description
Tests were added or are not required
Did the new or modified tests pass consistently locally and on CI?
Documentation was added or is not required

Deployment Notes

AI Disclosure

…arentShard Add detailed context to errors in verifyAllTabletsReachable to help operators identify which specific tablet is unreachable during PRS. Changes: - Show which tablet alias is unreachable - Include timeout value in error message - Distinguish between timeout and other reachability failures This helps quickly identify the problematic tablet instead of just seeing "context deadline exceeded". Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>

The error was being logged three times: 1. fmt.Printf in legacy_shim.go (user-friendly output) 2. log.Error in legacy_shim.go (duplicate) 3. log.Error in main.go (catch-all for all commands) Removed the duplicate log.Error call in legacy_shim.go and the unused log import. Now errors are only logged once with timestamp plus the user-friendly Printf output. Before: Error: rpc error... E0130 12:55:57.535295 1766589 legacy_shim.go:97] remote error: rpc error... E0130 12:55:57.539447 1766589 main.go:56] remote error: rpc error... After: Error: rpc error... E0130 12:55:57.539447 1766589 main.go:56] remote error: rpc error... Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>

Previously errors were shown twice: 1. fmt.Printf in legacy_shim.go (user-friendly output) 2. log.Error in main.go (timestamped log entry) Now only the user-friendly error message is shown. The command still exits with code 1 on errors by calling exit.Return(1), but returns nil to prevent main.go from logging the error again. Before: Error: rpc error: code = Unknown desc = timed out... E0130 12:55:57.539447 1766589 main.go:56] remote error: rpc error... After: Error: rpc error: code = Unknown desc = timed out... Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>

Restore the original error logging behavior in legacy vtctl commands. Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>

Update TestPlannedReparentShard/tablet_unreachable to expect the new improved error message that includes the tablet alias. Before: "global status vars failed" After: "failed to verify tablet zone1-0000000200 is reachable: global status vars failed" This matches the improvement made to verifyAllTabletsReachable in planned_reparenter.go which adds context about which tablet failed. Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>

Changed verbose operational logs in EmergencyReparentShard from erp.logger.Infof() to log.Infof() so they stay in vtctld server logs instead of streaming back to the CLI client. This dramatically reduces CLI output verbosity by keeping operational details on the server, while still showing important user-facing messages like errors and warnings. Changed logs: - "will initiate emergency reparent shard..." - "Getting a new durability policy..." - "getting replication position from..." - "started finding the intermediate source" - "finding intermediate source - sorted replica..." - "intermediate source selected..." - "intermediate source is ideal candidate..." - "setting up %v as new primary for uninitialized cluster" - "starting promotion for the new primary..." - "populating reparent journal on new primary..." - "setting new primary on replica..." - "found better candidate..." - "Removing %s from list of valid candidates..." (various reasons) - "Setting %s in list of valid candidates taking a backup" - "replica %v thinks it's primary but we failed to demote it..." Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>

codecov-commenter · 2026-01-30T23:25:58Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 69.76%. Comparing base (73edc90) to head (6ec3c93).

Additional details and impacted files

@@             Coverage Diff             @@
##           slack-22.0     #786   +/-   ##
===========================================
  Coverage       69.76%   69.76%           
===========================================
  Files            1605     1605           
  Lines          213608   213612    +4     
===========================================
+ Hits           149018   149021    +3     
- Misses          64590    64591    +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

) * Improve error messages when tablets are unreachable during PlannedReparentShard Add detailed context to errors in verifyAllTabletsReachable to help operators identify which specific tablet is unreachable during PRS. Changes: - Show which tablet alias is unreachable - Include timeout value in error message - Distinguish between timeout and other reachability failures This helps quickly identify the problematic tablet instead of just seeing "context deadline exceeded". Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com> * Remove duplicate error logging in legacy vtctl commands The error was being logged three times: 1. fmt.Printf in legacy_shim.go (user-friendly output) 2. log.Error in legacy_shim.go (duplicate) 3. log.Error in main.go (catch-all for all commands) Removed the duplicate log.Error call in legacy_shim.go and the unused log import. Now errors are only logged once with timestamp plus the user-friendly Printf output. Before: Error: rpc error... E0130 12:55:57.535295 1766589 legacy_shim.go:97] remote error: rpc error... E0130 12:55:57.539447 1766589 main.go:56] remote error: rpc error... After: Error: rpc error... E0130 12:55:57.539447 1766589 main.go:56] remote error: rpc error... Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com> * Show error only once in legacy vtctl commands Previously errors were shown twice: 1. fmt.Printf in legacy_shim.go (user-friendly output) 2. log.Error in main.go (timestamped log entry) Now only the user-friendly error message is shown. The command still exits with code 1 on errors by calling exit.Return(1), but returns nil to prevent main.go from logging the error again. Before: Error: rpc error: code = Unknown desc = timed out... E0130 12:55:57.539447 1766589 main.go:56] remote error: rpc error... After: Error: rpc error: code = Unknown desc = timed out... Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com> * Revert changes to legacy_shim.go Restore the original error logging behavior in legacy vtctl commands. Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com> * Update test expectation for improved error message Update TestPlannedReparentShard/tablet_unreachable to expect the new improved error message that includes the tablet alias. Before: "global status vars failed" After: "failed to verify tablet zone1-0000000200 is reachable: global status vars failed" This matches the improvement made to verifyAllTabletsReachable in planned_reparenter.go which adds context about which tablet failed. Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com> * Move verbose ERS operational logs from client to server logs Changed verbose operational logs in EmergencyReparentShard from erp.logger.Infof() to log.Infof() so they stay in vtctld server logs instead of streaming back to the CLI client. This dramatically reduces CLI output verbosity by keeping operational details on the server, while still showing important user-facing messages like errors and warnings. Changed logs: - "will initiate emergency reparent shard..." - "Getting a new durability policy..." - "getting replication position from..." - "started finding the intermediate source" - "finding intermediate source - sorted replica..." - "intermediate source selected..." - "intermediate source is ideal candidate..." - "setting up %v as new primary for uninitialized cluster" - "starting promotion for the new primary..." - "populating reparent journal on new primary..." - "setting new primary on replica..." - "found better candidate..." - "Removing %s from list of valid candidates..." (various reasons) - "Setting %s in list of valid candidates taking a backup" - "replica %v thinks it's primary but we failed to demote it..." Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com> --------- Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com> Co-authored-by: Claude <svc-devxp-claude@slack-corp.com>

github-actions bot added this to the v22.0.2 milestone Jan 30, 2026

tanjinx force-pushed the tanjin-prs-error-details branch from 48336d0 to 4ada9da Compare January 30, 2026 19:44

tanjinx and others added 3 commits January 30, 2026 12:58

Revert changes to legacy_shim.go

c1dbed0

Restore the original error logging behavior in legacy vtctl commands. Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>

tanjinx closed this Jan 30, 2026

tanjinx reopened this Jan 30, 2026

tanjinx marked this pull request as ready for review January 30, 2026 21:43

tanjinx requested a review from a team as a code owner January 30, 2026 21:43

tanjinx added v22 upstream-me labels Jan 30, 2026

tanjinx changed the title ~~Improve PlannedReparentShard timeout error messages~~ [slack-22.0] Improve PlannedReparentShard timeout error messages Jan 30, 2026

yushuqin approved these changes Jan 31, 2026

View reviewed changes

tanjinx merged commit 79471af into slack-22.0 Feb 2, 2026
86 of 93 checks passed

tanjinx deleted the tanjin-prs-error-details branch February 2, 2026 17:05

sbaker617 mentioned this pull request Feb 5, 2026

Rebase yolo-rebase-22.03 onto v22.0.3 #790

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[`slack-22.0`] Improve PlannedReparentShard timeout error messages#786

[`slack-22.0`] Improve PlannedReparentShard timeout error messages#786
tanjinx merged 6 commits intoslack-22.0from
tanjin-prs-error-details

tanjinx commented Jan 30, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Jan 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

tanjinx commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue(s)

Checklist

Deployment Notes

AI Disclosure

Uh oh!

codecov-commenter commented Jan 30, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tanjinx commented Jan 30, 2026 •

edited

Loading