feat(retry): complete bounded retry policy for LLM calls#866
feat(retry): complete bounded retry policy for LLM calls#866pikaxinge wants to merge 4 commits intosipeed:mainfrom
Conversation
|
Addressed the full review checklist from #594 in this PR (deadline-aware retries, 429 + Retry-After, jitter, explicit toolloop notice callback, and added coverage). @nikolasdehor could you take a look when you have time? I’d value your feedback on whether this resolves the concerns you raised on #594. |
There was a problem hiding this comment.
Pull request overview
This PR introduces a shared, bounded retry engine for LLM calls and integrates it into both the agent loop and the tool loop, using the central provider error classifier for retry decisions (including 429 + Retry-After handling).
Changes:
- Added
utils.DoWithRetry+RetryPolicywith bounded elapsed time, per-attempt timeouts, cancellable backoff sleeps, and jitter support. - Switched retry classification to
providers.ClassifyError, expanded transient timeout pattern matching, and added tests. - Integrated the retry engine into
pkg/agentandpkg/toolswith explicit retry-notice hooks.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/utils/llm_retry.go | New shared retry engine (policy, decision classification, backoff/jitter, Retry-After parsing, user-facing notice formatting). |
| pkg/utils/llm_retry_test.go | Unit tests for retry classification, bounded attempt timeout, cancellation during backoff, jittered backoff, and 429 Retry-After behavior. |
| pkg/tools/toolloop.go | Wraps toolloop provider Chat calls with bounded transient retries and exposes retry notices via config callback. |
| pkg/tools/toolloop_test.go | Tests toolloop retry vs non-retry behavior and notice emission. |
| pkg/providers/openai_compat/provider.go | Includes Retry-After response header in non-200 errors to support retry delay honoring upstream. |
| pkg/providers/error_classifier.go | Expands timeout/transient classification patterns (e.g., connection reset, TLS handshake timeout, EOF). |
| pkg/providers/error_classifier_test.go | Extends timeout-pattern tests for newly added timeout patterns. |
| pkg/agent/loop.go | Integrates bounded transient retries into the agent LLM iteration path; refactors context-window error detection into helper. |
| pkg/agent/loop_test.go | Adds tests verifying transient retry behavior and ensuring it does not trigger context compression; adds non-retryable path test. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
pkg/utils/llm_retry.go
Outdated
| Jitter RetryJitterFunc | ||
| } | ||
|
|
||
| var retryAfterPattern = regexp.MustCompile(`(?i)retry[- ]after[:=]?\s*([^\s\r\n]+)`) |
There was a problem hiding this comment.
retryAfterPattern only captures a single non-whitespace token after Retry-After, which breaks parsing for standard HTTP-date values (e.g. "Wed, 21 Oct 2015 07:28:00 GMT") because it would capture just "Wed,". This prevents honoring Retry-After when servers return a date. Consider capturing the full remainder of the line (up to \r/\n) and add a test case for an RFC1123 Retry-After date.
| var retryAfterPattern = regexp.MustCompile(`(?i)retry[- ]after[:=]?\s*([^\s\r\n]+)`) | |
| var retryAfterPattern = regexp.MustCompile(`(?i)retry[- ]after[:=]?\s*([^\r\n]+)`) |
There was a problem hiding this comment.
Fixed in 07e3f16: retryAfterPattern now captures the full line ([^\r\n]+) and I added TestLLMRetry_ExtractRetryAfter_HTTPDate to cover RFC1123 date parsing.
| func defaultJitter(max time.Duration) time.Duration { | ||
| if max <= 0 { | ||
| return 0 | ||
| } | ||
| //nolint:gosec // Used only for retry backoff jitter. | ||
| n := rand.Int63n(int64(max) + 1) |
There was a problem hiding this comment.
defaultJitter uses the global math/rand generator without any seeding in the repo, so jitter will be deterministic across process starts (and across replicas), reducing the effectiveness of jitter for avoiding thundering-herd retries. Consider using a properly-seeded source (or math/rand/v2 if available) scoped to this package.
| func defaultJitter(max time.Duration) time.Duration { | |
| if max <= 0 { | |
| return 0 | |
| } | |
| //nolint:gosec // Used only for retry backoff jitter. | |
| n := rand.Int63n(int64(max) + 1) | |
| var jitterRand = rand.New(rand.NewSource(time.Now().UnixNano())) | |
| func defaultJitter(max time.Duration) time.Duration { | |
| if max <= 0 { | |
| return 0 | |
| } | |
| //nolint:gosec // Used only for retry backoff jitter. | |
| n := jitterRand.Int63n(int64(max) + 1) |
| timeoutPatterns = []errorPattern{ | ||
| substr("timeout"), | ||
| substr("timed out"), | ||
| substr("deadline exceeded"), | ||
| substr("context deadline exceeded"), | ||
| substr("connection reset"), | ||
| substr("connection reset by peer"), | ||
| substr("tls handshake timeout"), | ||
| substr("eof"), | ||
| } |
There was a problem hiding this comment.
timeoutPatterns now includes "eof", but TestClassifyError_TimeoutPatterns doesn’t cover it (and currently also doesn’t cover the new "connection reset" variant). Adding test cases for these new patterns would prevent regressions in transient/timeout classification.
pkg/utils/llm_retry.go
Outdated
| timer := time.NewTimer(d) | ||
| defer timer.Stop() | ||
|
|
||
| select { | ||
| case <-ctx.Done(): | ||
| return ctx.Err() | ||
| case <-timer.C: | ||
| return nil | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
llm_retry.go introduces sleepWithContext, but pkg/utils/http_retry.go already has a package-private sleepWithCtx with the same behavior/signature. Since both are in package utils, consider reusing the existing helper (or consolidating to one name) to avoid duplicated cancellation-sleep logic.
| timer := time.NewTimer(d) | |
| defer timer.Stop() | |
| select { | |
| case <-ctx.Done(): | |
| return ctx.Err() | |
| case <-timer.C: | |
| return nil | |
| } | |
| } | |
| return sleepWithCtx(ctx, d) | |
| } |
nikolasdehor
left a comment
There was a problem hiding this comment.
This is a well-structured retry implementation that addresses the reviewer concerns from #594. Key observations:
Architecture: The retry engine in pkg/utils/llm_retry.go is generic (uses Go generics), testable (injectable sleep/jitter/notify functions), and properly separates concerns:
- Context-window errors trigger compression (agent loop concern)
- Transient errors trigger retry with backoff (shared infrastructure concern)
Correctness of retry classification: Using providers.ClassifyError instead of ad-hoc string matching is the right approach. The new timeout patterns (connection reset, tls handshake timeout, eof) are appropriate additions.
Parent deadline vs per-attempt timeout: boundedAttemptTimeout caps the attempt timeout to the remaining parent deadline, preventing the double-timeout/burned-attempt problem. This is correct.
429/Retry-After handling: The extractRetryAfter function parses both numeric seconds and HTTP-date formats from the error message. The regex approach works because the error message format is controlled by the provider code (which now includes Retry-After in the error string via the openai_compat change).
Jitter: Configurable with sensible 500ms default. Clamped to [0, MaxJitter] range.
Test coverage: Tests cover 429 with Retry-After, parent deadline not burning attempts, cancellation during backoff, jittered backoff bounds, and both toolloop and agent loop integration. Good coverage.
Minor observations (non-blocking):
-
isContextWindowErroruses broad string matching (e.g.,"token"alone would match "invalid token" or "authentication token"). The narrower patterns like"token limit","too many tokens"are better. Consider removing theinvalidparameterfallback that matches standalone"token"-- it could false-positive on auth errors. -
The
Retry-Afterregex inllm_retry.goparses from the error message string, which couples it to the error format inopenai_compat/provider.go. If another provider formats the error differently, Retry-After parsing would silently fail. This is acceptable for now but worth a comment. -
sleepWithCtxis referenced but not defined in the diff -- assuming it is a pre-existing utility function. -
#857 appears to be an earlier iteration of this same work. Consider closing #857 if this one is the canonical version.
LGTM. This supersedes #857 and addresses the #594 feedback comprehensively.
…te-retry # Conflicts: # pkg/agent/loop_test.go
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
pkg/tools/toolloop.go
Outdated
| if existingNotify != nil { | ||
| existingNotify(notice) | ||
| } | ||
| if config.RetryNotice != nil && channel != "" && chatID != "" { |
There was a problem hiding this comment.
RetryNotice is only invoked when channel and chatID are non-empty. Since the callback is already optional and can decide what to do, this conditional can unexpectedly suppress notices for callers that intentionally pass empty identifiers (e.g., CLI/internal runs). Consider calling config.RetryNotice(...) whenever it is non-nil, and let the implementation decide whether to ignore based on channel/chatID.
| if config.RetryNotice != nil && channel != "" && chatID != "" { | |
| if config.RetryNotice != nil { |
There was a problem hiding this comment.
Fixed in f8720e1: RetryNotice callback is now invoked whenever config.RetryNotice is non-nil, without requiring non-empty channel/chatID.
pkg/utils/llm_retry.go
Outdated
| if attemptTimeout := policy.AttemptTimeouts[attempt]; attemptTimeout > 0 { | ||
| timeout, ok := boundedAttemptTimeout(runCtx, attemptTimeout) | ||
| if !ok { | ||
| return zero, runCtx.Err() |
There was a problem hiding this comment.
In the !ok path, DoWithRetry returns runCtx.Err(). When boundedAttemptTimeout returns ok=false because time.Until(deadline) <= 0, it's possible for runCtx.Err() to still be nil briefly, which would cause DoWithRetry to return a nil error with the zero value result. Consider returning context.DeadlineExceeded (or ctx.Err() with a fallback) when ok is false to avoid a nil error return.
| return zero, runCtx.Err() | |
| if err := runCtx.Err(); err != nil { | |
| return zero, err | |
| } | |
| return zero, context.DeadlineExceeded |
There was a problem hiding this comment.
Fixed in f8720e1: when boundedAttemptTimeout returns ok=false and runCtx.Err() is nil, DoWithRetry now returns context.DeadlineExceeded instead of nil.
| func defaultJitter(max time.Duration) time.Duration { | ||
| if max <= 0 { | ||
| return 0 | ||
| } | ||
| //nolint:gosec // Used only for retry backoff jitter. | ||
| n := rand.Int63n(int64(max) + 1) | ||
| return time.Duration(n) |
There was a problem hiding this comment.
defaultJitter uses the global math/rand source without seeding, so jitter will be deterministic across process starts. For retry backoff jitter this can lead to synchronized retries after restarts. Consider using a locally-seeded RNG (e.g., rand.New(rand.NewSource(time.Now().UnixNano()))) or switching to crypto/rand for the default jitter implementation.
There was a problem hiding this comment.
Kept current implementation intentionally: this repo targets Go 1.25.7, and in Go >=1.20 the default math/rand generator is auto-seeded at startup. We are using the package-level concurrent-safe API (rand.Int63n) and not calling Seed.
Summary
pkg/utils/llm_retry.go) for LLM calls with bounded elapsed time, per-attempt timeouts, retry classification, and cancellable waitsproviders.ClassifyErrorfor retry decisions (timeout/rate_limit), including 429 handling and Retry-After parsing supportpkg/agent/loop.goandpkg/tools/toolloop.goRetryNoticecallback inToolLoopConfigconnection reset,tls handshake timeout) and add testsAddresses #594 reviewer concerns
providers.ClassifyErrorinstead of local ad-hoc status parsing.RetryNotice) instead of calling an implicitmessagetool.Test Plan
go test ./pkg/utils -run 'TestLLMRetry'go test ./pkg/tools -run 'TestRunToolLoop_'go test ./pkg/agent -run 'TestAgentLoop_TransientLLMErrorRetry|TestAgentLoop_NonRetryableLLMError_NoRetry|TestAgentLoop_ContextExhaustionRetry'go test ./pkg/providers -run 'TestClassifyError_TimeoutPatterns'go test ./pkg/...Fixes #629