Skip to content

Comments

[slack-22.0] Improve PlannedReparentShard timeout error messages#786

Merged
tanjinx merged 6 commits intoslack-22.0from
tanjin-prs-error-details
Feb 2, 2026
Merged

[slack-22.0] Improve PlannedReparentShard timeout error messages#786
tanjinx merged 6 commits intoslack-22.0from
tanjin-prs-error-details

Conversation

@tanjinx
Copy link

@tanjinx tanjinx commented Jan 30, 2026

Add detailed context to timeout errors in PlannedReparentShard to help operators diagnose failures. Changes include:

  • Include tablet type in replica reparent failures

This helps quickly identify whether timeouts are due to unreachable tablets, slow replication, or semi-sync configuration issues.

before:

Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
E0130 10:41:07.622577 1661946 legacy_shim.go:97] remote error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
E0130 10:41:07.627387 1661946 main.go:56] remote error: rpc error: code = DeadlineExceeded desc = context deadline exceeded

after:

E0130 13:37:55.386095 1799345 main.go:56] rpc error: code = Unknown desc = failed to verify tablet us_east_1e-0171543813 is reachable: rpc error: code = Unknown desc = TabletManager.GetGlobalStatusVars on us_east_1e-0171543813: net.Dial(/mnt/vitess/mysql/datadir/mysql.sock) to local server failed: dial unix /mnt/vitess/mysql/datadir/mysql.sock: connect: resource temporarily unavailable (errno 2002) (sqlstate HY000)

Description

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

AI Disclosure

@github-actions github-actions bot added this to the v22.0.2 milestone Jan 30, 2026
…arentShard

Add detailed context to errors in verifyAllTabletsReachable to help
operators identify which specific tablet is unreachable during PRS.

Changes:
- Show which tablet alias is unreachable
- Include timeout value in error message
- Distinguish between timeout and other reachability failures

This helps quickly identify the problematic tablet instead of just
seeing "context deadline exceeded".

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>
@tanjinx tanjinx force-pushed the tanjin-prs-error-details branch from 48336d0 to 4ada9da Compare January 30, 2026 19:44
tanjinx and others added 3 commits January 30, 2026 12:58
The error was being logged three times:
1. fmt.Printf in legacy_shim.go (user-friendly output)
2. log.Error in legacy_shim.go (duplicate)
3. log.Error in main.go (catch-all for all commands)

Removed the duplicate log.Error call in legacy_shim.go and the
unused log import. Now errors are only logged once with timestamp
plus the user-friendly Printf output.

Before:
  Error: rpc error...
  E0130 12:55:57.535295 1766589 legacy_shim.go:97] remote error: rpc error...
  E0130 12:55:57.539447 1766589 main.go:56] remote error: rpc error...

After:
  Error: rpc error...
  E0130 12:55:57.539447 1766589 main.go:56] remote error: rpc error...

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>
Previously errors were shown twice:
1. fmt.Printf in legacy_shim.go (user-friendly output)
2. log.Error in main.go (timestamped log entry)

Now only the user-friendly error message is shown. The command still
exits with code 1 on errors by calling exit.Return(1), but returns nil
to prevent main.go from logging the error again.

Before:
  Error: rpc error: code = Unknown desc = timed out...
  E0130 12:55:57.539447 1766589 main.go:56] remote error: rpc error...

After:
  Error: rpc error: code = Unknown desc = timed out...

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>
Restore the original error logging behavior in legacy vtctl commands.

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>
@tanjinx tanjinx closed this Jan 30, 2026
@tanjinx tanjinx reopened this Jan 30, 2026
@tanjinx tanjinx marked this pull request as ready for review January 30, 2026 21:43
@tanjinx tanjinx requested a review from a team as a code owner January 30, 2026 21:43
Update TestPlannedReparentShard/tablet_unreachable to expect the new
improved error message that includes the tablet alias.

Before: "global status vars failed"
After: "failed to verify tablet zone1-0000000200 is reachable: global status vars failed"

This matches the improvement made to verifyAllTabletsReachable in
planned_reparenter.go which adds context about which tablet failed.

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>
@tanjinx tanjinx changed the title Improve PlannedReparentShard timeout error messages [slack-22.0] Improve PlannedReparentShard timeout error messages Jan 30, 2026
Changed verbose operational logs in EmergencyReparentShard from
erp.logger.Infof() to log.Infof() so they stay in vtctld server logs
instead of streaming back to the CLI client.

This dramatically reduces CLI output verbosity by keeping operational
details on the server, while still showing important user-facing
messages like errors and warnings.

Changed logs:
- "will initiate emergency reparent shard..."
- "Getting a new durability policy..."
- "getting replication position from..."
- "started finding the intermediate source"
- "finding intermediate source - sorted replica..."
- "intermediate source selected..."
- "intermediate source is ideal candidate..."
- "setting up %v as new primary for uninitialized cluster"
- "starting promotion for the new primary..."
- "populating reparent journal on new primary..."
- "setting new primary on replica..."
- "found better candidate..."
- "Removing %s from list of valid candidates..." (various reasons)
- "Setting %s in list of valid candidates taking a backup"
- "replica %v thinks it's primary but we failed to demote it..."

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>
@codecov-commenter
Copy link

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 69.76%. Comparing base (73edc90) to head (6ec3c93).

Additional details and impacted files
@@             Coverage Diff             @@
##           slack-22.0     #786   +/-   ##
===========================================
  Coverage       69.76%   69.76%           
===========================================
  Files            1605     1605           
  Lines          213608   213612    +4     
===========================================
+ Hits           149018   149021    +3     
- Misses          64590    64591    +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@tanjinx tanjinx merged commit 79471af into slack-22.0 Feb 2, 2026
86 of 93 checks passed
@tanjinx tanjinx deleted the tanjin-prs-error-details branch February 2, 2026 17:05
sbaker617 pushed a commit that referenced this pull request Feb 5, 2026
)

* Improve error messages when tablets are unreachable during PlannedReparentShard

Add detailed context to errors in verifyAllTabletsReachable to help
operators identify which specific tablet is unreachable during PRS.

Changes:
- Show which tablet alias is unreachable
- Include timeout value in error message
- Distinguish between timeout and other reachability failures

This helps quickly identify the problematic tablet instead of just
seeing "context deadline exceeded".

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>

* Remove duplicate error logging in legacy vtctl commands

The error was being logged three times:
1. fmt.Printf in legacy_shim.go (user-friendly output)
2. log.Error in legacy_shim.go (duplicate)
3. log.Error in main.go (catch-all for all commands)

Removed the duplicate log.Error call in legacy_shim.go and the
unused log import. Now errors are only logged once with timestamp
plus the user-friendly Printf output.

Before:
  Error: rpc error...
  E0130 12:55:57.535295 1766589 legacy_shim.go:97] remote error: rpc error...
  E0130 12:55:57.539447 1766589 main.go:56] remote error: rpc error...

After:
  Error: rpc error...
  E0130 12:55:57.539447 1766589 main.go:56] remote error: rpc error...

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>

* Show error only once in legacy vtctl commands

Previously errors were shown twice:
1. fmt.Printf in legacy_shim.go (user-friendly output)
2. log.Error in main.go (timestamped log entry)

Now only the user-friendly error message is shown. The command still
exits with code 1 on errors by calling exit.Return(1), but returns nil
to prevent main.go from logging the error again.

Before:
  Error: rpc error: code = Unknown desc = timed out...
  E0130 12:55:57.539447 1766589 main.go:56] remote error: rpc error...

After:
  Error: rpc error: code = Unknown desc = timed out...

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>

* Revert changes to legacy_shim.go

Restore the original error logging behavior in legacy vtctl commands.

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>

* Update test expectation for improved error message

Update TestPlannedReparentShard/tablet_unreachable to expect the new
improved error message that includes the tablet alias.

Before: "global status vars failed"
After: "failed to verify tablet zone1-0000000200 is reachable: global status vars failed"

This matches the improvement made to verifyAllTabletsReachable in
planned_reparenter.go which adds context about which tablet failed.

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>

* Move verbose ERS operational logs from client to server logs

Changed verbose operational logs in EmergencyReparentShard from
erp.logger.Infof() to log.Infof() so they stay in vtctld server logs
instead of streaming back to the CLI client.

This dramatically reduces CLI output verbosity by keeping operational
details on the server, while still showing important user-facing
messages like errors and warnings.

Changed logs:
- "will initiate emergency reparent shard..."
- "Getting a new durability policy..."
- "getting replication position from..."
- "started finding the intermediate source"
- "finding intermediate source - sorted replica..."
- "intermediate source selected..."
- "intermediate source is ideal candidate..."
- "setting up %v as new primary for uninitialized cluster"
- "starting promotion for the new primary..."
- "populating reparent journal on new primary..."
- "setting new primary on replica..."
- "found better candidate..."
- "Removing %s from list of valid candidates..." (various reasons)
- "Setting %s in list of valid candidates taking a backup"
- "replica %v thinks it's primary but we failed to demote it..."

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>

---------

Signed-off-by: Tanjin Xu <tanjin.xu@slack-corp.com>
Co-authored-by: Claude <svc-devxp-claude@slack-corp.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants