[slack-22.0] Improve PlannedReparentShard timeout error messages#786
Merged
tanjinx merged 6 commits intoslack-22.0from Feb 2, 2026
Merged
[slack-22.0] Improve PlannedReparentShard timeout error messages#786tanjinx merged 6 commits intoslack-22.0from
slack-22.0] Improve PlannedReparentShard timeout error messages#786tanjinx merged 6 commits intoslack-22.0from
Conversation
…arentShard Add detailed context to errors in verifyAllTabletsReachable to help operators identify which specific tablet is unreachable during PRS. Changes: - Show which tablet alias is unreachable - Include timeout value in error message - Distinguish between timeout and other reachability failures This helps quickly identify the problematic tablet instead of just seeing "context deadline exceeded". Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>
48336d0 to
4ada9da
Compare
The error was being logged three times: 1. fmt.Printf in legacy_shim.go (user-friendly output) 2. log.Error in legacy_shim.go (duplicate) 3. log.Error in main.go (catch-all for all commands) Removed the duplicate log.Error call in legacy_shim.go and the unused log import. Now errors are only logged once with timestamp plus the user-friendly Printf output. Before: Error: rpc error... E0130 12:55:57.535295 1766589 legacy_shim.go:97] remote error: rpc error... E0130 12:55:57.539447 1766589 main.go:56] remote error: rpc error... After: Error: rpc error... E0130 12:55:57.539447 1766589 main.go:56] remote error: rpc error... Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>
Previously errors were shown twice: 1. fmt.Printf in legacy_shim.go (user-friendly output) 2. log.Error in main.go (timestamped log entry) Now only the user-friendly error message is shown. The command still exits with code 1 on errors by calling exit.Return(1), but returns nil to prevent main.go from logging the error again. Before: Error: rpc error: code = Unknown desc = timed out... E0130 12:55:57.539447 1766589 main.go:56] remote error: rpc error... After: Error: rpc error: code = Unknown desc = timed out... Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>
Restore the original error logging behavior in legacy vtctl commands. Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>
Update TestPlannedReparentShard/tablet_unreachable to expect the new improved error message that includes the tablet alias. Before: "global status vars failed" After: "failed to verify tablet zone1-0000000200 is reachable: global status vars failed" This matches the improvement made to verifyAllTabletsReachable in planned_reparenter.go which adds context about which tablet failed. Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>
slack-22.0] Improve PlannedReparentShard timeout error messages
Changed verbose operational logs in EmergencyReparentShard from erp.logger.Infof() to log.Infof() so they stay in vtctld server logs instead of streaming back to the CLI client. This dramatically reduces CLI output verbosity by keeping operational details on the server, while still showing important user-facing messages like errors and warnings. Changed logs: - "will initiate emergency reparent shard..." - "Getting a new durability policy..." - "getting replication position from..." - "started finding the intermediate source" - "finding intermediate source - sorted replica..." - "intermediate source selected..." - "intermediate source is ideal candidate..." - "setting up %v as new primary for uninitialized cluster" - "starting promotion for the new primary..." - "populating reparent journal on new primary..." - "setting new primary on replica..." - "found better candidate..." - "Removing %s from list of valid candidates..." (various reasons) - "Setting %s in list of valid candidates taking a backup" - "replica %v thinks it's primary but we failed to demote it..." Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## slack-22.0 #786 +/- ##
===========================================
Coverage 69.76% 69.76%
===========================================
Files 1605 1605
Lines 213608 213612 +4
===========================================
+ Hits 149018 149021 +3
- Misses 64590 64591 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
yushuqin
approved these changes
Jan 31, 2026
sbaker617
pushed a commit
that referenced
this pull request
Feb 5, 2026
) * Improve error messages when tablets are unreachable during PlannedReparentShard Add detailed context to errors in verifyAllTabletsReachable to help operators identify which specific tablet is unreachable during PRS. Changes: - Show which tablet alias is unreachable - Include timeout value in error message - Distinguish between timeout and other reachability failures This helps quickly identify the problematic tablet instead of just seeing "context deadline exceeded". Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com> * Remove duplicate error logging in legacy vtctl commands The error was being logged three times: 1. fmt.Printf in legacy_shim.go (user-friendly output) 2. log.Error in legacy_shim.go (duplicate) 3. log.Error in main.go (catch-all for all commands) Removed the duplicate log.Error call in legacy_shim.go and the unused log import. Now errors are only logged once with timestamp plus the user-friendly Printf output. Before: Error: rpc error... E0130 12:55:57.535295 1766589 legacy_shim.go:97] remote error: rpc error... E0130 12:55:57.539447 1766589 main.go:56] remote error: rpc error... After: Error: rpc error... E0130 12:55:57.539447 1766589 main.go:56] remote error: rpc error... Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com> * Show error only once in legacy vtctl commands Previously errors were shown twice: 1. fmt.Printf in legacy_shim.go (user-friendly output) 2. log.Error in main.go (timestamped log entry) Now only the user-friendly error message is shown. The command still exits with code 1 on errors by calling exit.Return(1), but returns nil to prevent main.go from logging the error again. Before: Error: rpc error: code = Unknown desc = timed out... E0130 12:55:57.539447 1766589 main.go:56] remote error: rpc error... After: Error: rpc error: code = Unknown desc = timed out... Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com> * Revert changes to legacy_shim.go Restore the original error logging behavior in legacy vtctl commands. Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com> * Update test expectation for improved error message Update TestPlannedReparentShard/tablet_unreachable to expect the new improved error message that includes the tablet alias. Before: "global status vars failed" After: "failed to verify tablet zone1-0000000200 is reachable: global status vars failed" This matches the improvement made to verifyAllTabletsReachable in planned_reparenter.go which adds context about which tablet failed. Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com> * Move verbose ERS operational logs from client to server logs Changed verbose operational logs in EmergencyReparentShard from erp.logger.Infof() to log.Infof() so they stay in vtctld server logs instead of streaming back to the CLI client. This dramatically reduces CLI output verbosity by keeping operational details on the server, while still showing important user-facing messages like errors and warnings. Changed logs: - "will initiate emergency reparent shard..." - "Getting a new durability policy..." - "getting replication position from..." - "started finding the intermediate source" - "finding intermediate source - sorted replica..." - "intermediate source selected..." - "intermediate source is ideal candidate..." - "setting up %v as new primary for uninitialized cluster" - "starting promotion for the new primary..." - "populating reparent journal on new primary..." - "setting new primary on replica..." - "found better candidate..." - "Removing %s from list of valid candidates..." (various reasons) - "Setting %s in list of valid candidates taking a backup" - "replica %v thinks it's primary but we failed to demote it..." Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com> Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com> --------- Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com> Co-authored-by: Claude <svc-devxp-claude@slack-corp.com>
10 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add detailed context to timeout errors in PlannedReparentShard to help operators diagnose failures. Changes include:
This helps quickly identify whether timeouts are due to unreachable tablets, slow replication, or semi-sync configuration issues.
before:
after:
Description
Related Issue(s)
Checklist
Deployment Notes
AI Disclosure