feat: add retry logic to MCP client connection establishment and tool retrieval#1568
Conversation
🧪 Test Suite AvailableThis PR can be tested by a repository admin. |
This stack of pull requests is managed by Graphite. Learn more about stacking. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds exponential backoff retry logic for MCP client connection, transport start, initialization, and tool retrieval; increases connection establishment timeout. Removes explicit logger parameters from provider ListModels flows in favor of internal logger access. Adds NoOpLogger, updates model-catalog pricing reload/sync, and related docs and changelogs. Changes
Sequence DiagramsequenceDiagram
participant CM as ClientManager
participant RW as RetryWrapper (ExecuteWithRetry)
participant EC as ExternalClient
participant TR as Transport
participant INIT as Initialize()
CM->>RW: connectToMCPClient(ctx)
RW->>EC: create fresh ExternalClient (per attempt)
RW->>TR: create Transport(using EC)
RW->>TR: Start(ctx with per-attempt timeout)
alt Start succeeds
TR-->>RW: started
RW->>INIT: Initialize(ctx with per-attempt timeout)
alt Initialize succeeds
INIT-->>RW: initialized
RW-->>CM: return connected EC
else Initialize transient error
INIT-->>RW: transient error
RW->>EC: Close() & cleanup
RW->>RW: wait exponential backoff then retry
else Initialize permanent error
INIT-->>RW: permanent error
RW->>EC: Close() & cleanup
RW-->>CM: return error
end
else Start transient error
TR-->>RW: transient error
RW->>EC: Close() & cleanup
RW->>RW: wait exponential backoff then retry
else Start permanent error
TR-->>RW: permanent error
RW->>EC: Close() & cleanup
RW-->>CM: return error
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Fix all issues with AI agents
In `@core/logger.go`:
- Around line 124-138: NoOpLogger.Fatal currently is a no-op which violates the
schemas.Logger contract that Fatal must terminate the program; update
NoOpLogger.Fatal to honor that contract (e.g., log the message then terminate
with os.Exit(1) or panic) so callers relying on termination behave correctly;
modify the implementation in the NoOpLogger type (the Fatal method) and apply
the same change consistently to other no-op logger implementations referenced
(e.g., those in core/mcp/init.go and core/providers/utils/utils.go) so all
implementations of schemas.Logger.Fatal terminate the process as documented.
In `@core/mcp/clientmanager.go`:
- Around line 552-568: When Start of the created externalClient (call to
externalClient.Start inside ExecuteWithRetry) fails or later initialization
fails, the code currently only cancels the context and returns, leaking
resources; update both error paths to close the created client: call
externalClient.Close(ctx) (or externalClient.Close() if that signature is used)
and log any Close error before returning, mirroring the cleanup pattern used
later (see the cleanup at the end of the function). Ensure this is applied for
the transport-start failure block (after ExecuteWithRetry around
externalClient.Start) and for the initialization failure path (around the
initialization step referenced near line 609), and keep the conditional cancel()
for STDIO/SSE connection types.
In `@core/mcp/utils.go`:
- Around line 237-283: The retry loop doubles the backoff before sleeping so
InitialBackoff is never used; change the logic in the loop that uses
backoff/nextBackoff (variables in this block) to sleep on the current backoff,
then after the sleep update backoff = min(backoff*2, config.MaxBackoff).
Specifically, remove computing nextBackoff := backoff * 2 before the time.After,
use time.After(backoff) for the wait (and log backoff), then set backoff =
nextBackoff (capped by config.MaxBackoff) for the next iteration; keep checks
for ctx.Done(), isTransientError, and attempt == config.MaxRetries as-is.
In `@docs/mcp/connecting-to-servers.mdx`:
- Around line 643-650: Docs list HTTP 504 as a transient error but the code's
transient list in isTransientError (in utils.go) only contains
"503","502","429","500"; update the code to match docs by adding "504" to the
explicit transient HTTP status set used by isTransientError, or alternatively
remove "504" from the docs—preferably add "504" to the
transientStatuses/explicit transient list inside isTransientError so 504 Gateway
Timeout is treated as transient alongside 500/502/503/429.
- Around line 630-637: The backoff table is inconsistent with the
implementation: remove "Attempt 7" because DefaultRetryConfig.MaxRetries = 5
yields only 6 attempts total (1 initial + 5 retries), and update the first wait
entry to match the actual behavior in ExecuteWithRetry (if the backoff bug is
unfixed set Attempt 2 wait to 2s; if you've patched ExecuteWithRetry to use
exponential base-2 backoff set Attempt 2 wait to 1s). Ensure the doc references
DefaultRetryConfig.MaxRetries and ExecuteWithRetry when you adjust the table so
future changes remain consistent.
🧹 Nitpick comments (2)
core/mcp/utils.go (1)
211-213: Defaulting to transient (retryable) for unrecognized errors is aggressive.Any error not explicitly classified — including configuration / programming bugs (e.g.
"unknown connection type"from the transport switch'sdefaultbranch) — will be retried 5 times with long backoffs. A safer default isreturn false, relying on the explicit transient list to opt errors in to retries.Proposed change
- // Default: treat as transient to be safe (connection-related errors) - // This ensures we retry unknown errors that are likely transient - return true + // Default: treat as permanent to avoid retrying non-transient errors + // (e.g., programming bugs, unknown config errors) + return falsecore/mcp/clientmanager.go (1)
554-562: Shared timeout context limits effective retry count.For HTTP connections,
ctxat line 558 has a single 60s timeout created at line 548. Each retry callsexternalClient.Start(ctx)with this same context, so the total time across all retry attempts is capped at 60s — not per attempt. With backoff waits summing to 60s alone, only ~2 retries are realistically possible beforecontext.DeadlineExceededfires (whichisTransientErrorclassifies as permanent, stopping retries).Similarly,
initCtxat line 603 for SSE/STDIO has a single 60s timeout.If the intent is to allow each attempt its own time budget, create a per-attempt timeout context inside the retry closure.
Also applies to: 598-608
4a5372e to
e8f4299
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@core/mcp/clientmanager.go`:
- Around line 604-626: The initCtx deadline is created once and reused across
ExecuteWithRetry attempts so slow first attempts can exhaust the deadline and
prevent further retries; move creation of initCtx (and its cancel) inside the
retry closure passed to ExecuteWithRetry so each attempt gets a fresh
per-attempt timeout (e.g., recreate initCtx with the 60s deadline inside the
anonymous func used by ExecuteWithRetry), ensure you call cancel() after each
attempt to avoid leaks, and keep existing cleanup logic (externalClient.Close()
and logging) unchanged; this preserves expected retry behavior and works with
isTransientError/DefaultRetryConfig semantics.
- Around line 59-72: ReconnectClient currently wraps connectToMCPClient in an
outer ExecuteWithRetry using DefaultRetryConfig, causing nested/compounding
retries since connectToMCPClient already performs its own ExecuteWithRetry
calls; remove the outer ExecuteWithRetry and call m.connectToMCPClient(config)
directly from ReconnectClient (mirroring AddClient), then handle and return any
error (e.g., fmt.Errorf with id and err) so retries are only the internal
per-step retries managed by connectToMCPClient.
- Around line 552-574: The current retry wraps externalClient.Start() (and
similarly externalClient.Initialize()) with ExecuteWithRetry which repeatedly
calls the one-time lifecycle methods on the same *client instance, causing
resource leaks; change the logic so each retry attempt uses a fresh client: on
Start/Initialize error close the failing externalClient (if non-nil), recreate a
new client instance (call the existing NewClient / client construction code),
and then call Start()/Initialize() once on that new instance inside the retry
loop (or alternatively move the retry loop above NewClient so callers create a
new client per attempt); ensure you still cancel the long-lived context
(cancel()) for SSE/STDIO failures and preserve the existing cleanup logging
(m.logger.Warn) when closing clients.
🧹 Nitpick comments (1)
core/mcp/utils.go (1)
141-214: Substring-based error classification is fragile — bare HTTP status codes can false-match.The permanent/transient lists use short substrings like
"500","400","503","429","not found". These can inadvertently match port numbers, request IDs, or other numeric fragments in error messages (e.g.,"connection to port 5003 failed"matches"500", and"max 4004 connections"matches"400").Additionally, the
default: return trueon line 213 means every error not matched by either list is treated as transient. This causes clearly permanent errors (e.g.,"unknown connection type"returned fromconnectToMCPClient) to be retried needlessly.Consider using word-boundary-aware matching or checking for structured error types/codes before falling back to string matching. If string matching is kept, padding status codes (e.g.,
"HTTP 500","status 400") would reduce false positives.
e8f4299 to
0b694e7
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@core/logger.go`:
- Around line 124-138: There are three duplicated private noop logger
implementations that should be removed and replaced with the public NoOpLogger;
update the other modules to use NewNoOpLogger() instead of their private
types/constructors, delete the duplicate noopLogger type definitions and methods
(the private noop implementations), and ensure any places constructing or
referencing those private noop loggers now call NewNoOpLogger() and use the
schemas.Logger interface; keep the NoOpLogger methods/signature as-is so callers
don’t need changes beyond replacing constructor calls.
In `@core/mcp/clientmanager.go`:
- Around line 512-547: The retry loop currently calls externalClient.Start(ctx)
using the longLivedCtx (longLivedCtx) with no per-attempt deadline; update the
ExecuteWithRetry closure so each attempt derives a per-attempt timeout context
from longLivedCtx (e.g., via context.WithTimeout), use that perAttemptCtx when
calling externalClient.Start, ensure you call the cancel() after Start returns
and still close the previous externalClient on retry failures; keep the rest of
the retry logic (transportRetryConfig,
m.createHTTPConnection/m.createSTDIOConnection/m.createSSEConnection/m.createInProcessConnection
and externalClient.Close) the same so each attempt uses a fresh client and times
out reliably.
In `@docs/mcp/connecting-to-servers.mdx`:
- Around line 629-635: The documentation's "Backoff Progression" no longer
matches the corrected behavior in ExecuteWithRetry with DefaultRetryConfig;
update the list/table to show that ExecuteWithRetry sleeps the current backoff
starting from InitialBackoff = 1s before doubling, producing waits: Attempt 1
(no wait), Attempt 2: 1s, Attempt 3: 2s, Attempt 4: 4s, Attempt 5: 8s, Attempt
6: 16s. Replace the old 2s→4s→8s→16s→30s sequence with this corrected
progression and mention InitialBackoff and DefaultRetryConfig as the source of
the values.
In `@plugins/maxim/go.mod`:
- Around line 11-14: Remove the unused direct dependency
github.com/bytedance/sonic v1.14.2 from the require block in
plugins/maxim/go.mod (leave github.com/google/uuid v1.6.0 intact), verify there
are no imports of "github.com/bytedance/sonic" in the code (e.g., search the
plugins/maxim package and files like main.go), then run go mod tidy to let Go
resolve transitive deps and update go.sum accordingly.
🧹 Nitpick comments (2)
core/mcp/utils.go (2)
141-214: Default-to-transient (line 213) is a bold choice — verify it's intentional.
isTransientErrorreturnstruefor any error that doesn't match the explicit permanent or transient lists. This means any unknown/unexpected error will be retried up to 5 times before failing. While the comment says this is intentional for "connection-related errors", it also means that, e.g., a server returning an unexpected error string will be retried.If the intent is safety-biased (retry when unsure), this is fine but worth calling out: a miscategorized permanent error will cause unnecessary retries and delay the failure response by up to ~31 seconds (1+2+4+8+16s).
Also, consider whether the string-matching approach is fragile against errors from different locales, wrapped errors, or upstream library changes. Using typed errors (e.g.,
errors.Asfor specific HTTP status codes) where possible would be more robust — though this may depend on whatmcp-goexposes.
228-280: Backoff is correct after the fix — minor nit on duration multiplication.The backoff logic is now correct: sleep on current backoff, then double.
Line 273 uses
time.Duration(float64(backoff) * 2)— this works butbackoff * 2is idiomatic Go sincetime.Durationsupports integer multiplication and avoids the float64 round-trip.Suggested simplification
- backoff = time.Duration(float64(backoff) * 2) + backoff *= 2
c09e4e1 to
a1b748d
Compare
b287a86 to
ce2d462
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@transports/bifrost-http/server/server.go`:
- Around line 671-685: The code causes duplicate model entries on repeated
ForceReloadPricing because ModelCatalog.AddModelDataToPool appends without
deduping and the "models added to catalog" log fires even when listing fails;
modify the reload path (after s.Config.ModelCatalog.ForceReloadPricing and
s.Client.ListAllModels) to only call AddModelDataToPool when ListAllModels
succeeds and either clear or replace the existing pool before adding (e.g., call
a clear/ReplaceModelPool method on ModelCatalog or implement dedupe in
AddModelDataToPool), and move the logger.Info("models added to catalog") inside
the successful branch so it only logs when models were actually added; reference
symbols: ForceReloadPricing, ListAllModels, AddModelDataToPool,
populateModelPoolFromPricingData.
🧹 Nitpick comments (1)
core/mcp/clientmanager.go (1)
587-624: Consider per-attempt timeout for Initialize retry.Unlike the transport
Start()retry (which correctly createsperAttemptCtxinside the closure at lines 544-549), theInitialize()retry shares a singleinitCtxacross all attempts (created at line 594). If an early attempt consumes most of the 60s timeout, subsequent attempts will have insufficient time and fail withcontext.DeadlineExceeded, whichisTransientErrortreats as permanent (non-retryable).This is a minor concern since initialization is typically fast, but for consistency with the transport retry pattern, consider moving
initCtxcreation inside the retry closure.♻️ Suggested fix for consistency
- // For STDIO/SSE: Use a timeout context for initialization to prevent indefinite hangs - // The subprocess will continue running with the long-lived context - var initCtx context.Context - var initCancel context.CancelFunc - - if config.ConnectionType == schemas.MCPConnectionTypeSSE || config.ConnectionType == schemas.MCPConnectionTypeSTDIO { - // Create timeout context for initialization phase only - initCtx, initCancel = context.WithTimeout(longLivedCtx, MCPClientConnectionEstablishTimeout) - defer initCancel() - m.logger.Debug("%s [%s] Initializing client with %v timeout...", MCPLogPrefix, config.Name, MCPClientConnectionEstablishTimeout) - } else { - // HTTP already has timeout - initCtx = ctx - } - // Initialize client with retry logic initRetryConfig := DefaultRetryConfig err = ExecuteWithRetry( m.ctx, func() error { + // Create per-attempt timeout context for initialization + var initCtx context.Context + var initCancel context.CancelFunc + if config.ConnectionType == schemas.MCPConnectionTypeSSE || config.ConnectionType == schemas.MCPConnectionTypeSTDIO { + initCtx, initCancel = context.WithTimeout(longLivedCtx, MCPClientConnectionEstablishTimeout) + } else { + initCtx, initCancel = context.WithTimeout(ctx, MCPClientConnectionEstablishTimeout) + } + defer initCancel() _, initErr := externalClient.Initialize(initCtx, extInitRequest) return initErr }, initRetryConfig, m.logger, )
ce2d462 to
6d75121
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
core/mcp/utils.go (1)
284-343:⚠️ Potential issue | 🟠 MajorAvoid reverse-mapping tool names to original MCP names.
This function still builds a sanitized→original mapping and retains original names for execution; the MCP tool system expects sanitized names end-to-end, so reverse mapping risks inconsistencies with
CallToolParams.NameandExtraFields.ToolName. Please drop the mapping and use sanitized names throughout. Based on learnings: “In MCP tool system under core/mcp/, tool names are sanitized by replacing '-' with '_' during discovery and this sanitized form is used throughout the system, including CallToolParams.Name and ExtraFields.ToolName. Do not reverse-sanitize or maintain mappings back to original names; rely on sanitized names in all MCP server calls and UI representations.”
🧹 Nitpick comments (1)
core/mcp/clientmanager.go (1)
589-624: Minor:initCtxtimeout is shared across all retry attempts.The
initCtxcreated at line 594 has a single 60s deadline shared across allExecuteWithRetryattempts. If early attempts consume significant time, later attempts will have reduced timeout windows.This is likely acceptable since
Initialize()is typically fast (protocol negotiation), but it differs from the per-attempt timeout pattern used forStart(). Consider creatinginitCtxinside the retry closure if consistent per-attempt behavior is desired.
b37d9c2 to
d21585b
Compare
Merge activity
|
d21585b to
871d512
Compare
871d512 to
fda220f
Compare

Add retry logic to MCP client connections and tool retrieval
Adds robust exponential backoff retry logic to MCP client connections and tool retrieval operations. This improves resilience against transient network failures and temporary service unavailability.
Changes
Type of change
Affected areas
How to test
Test MCP client connections with intermittent network failures:
Breaking changes
Security considerations
No additional security implications. The retry logic only applies to already authenticated connections and doesn't bypass any security checks.