Skip to content

feat(retry): complete bounded retry policy for LLM calls#866

Open
pikaxinge wants to merge 4 commits intosipeed:mainfrom
pikaxinge:feat/issue-629-complete-retry
Open

feat(retry): complete bounded retry policy for LLM calls#866
pikaxinge wants to merge 4 commits intosipeed:mainfrom
pikaxinge:feat/issue-629-complete-retry

Conversation

@pikaxinge
Copy link

@pikaxinge pikaxinge commented Feb 27, 2026

Summary

  • add a shared retry engine (pkg/utils/llm_retry.go) for LLM calls with bounded elapsed time, per-attempt timeouts, retry classification, and cancellable waits
  • use providers.ClassifyError for retry decisions (timeout/rate_limit), including 429 handling and Retry-After parsing support
  • add jittered backoff and parent-deadline-aware timeout capping to avoid double-timeout/burn-attempt behavior
  • integrate retry engine into both pkg/agent/loop.go and pkg/tools/toolloop.go
  • preserve context-window compression retry path in agent loop as a separate concern
  • replace implicit toolloop notice path with explicit RetryNotice callback in ToolLoopConfig
  • extend transient timeout classification (connection reset, tls handshake timeout) and add tests

Addresses #594 reviewer concerns

  1. Parent deadline vs per-attempt timeout: attempt timeout is capped by remaining parent deadline.
  2. Fragile retry classification: retry decisions use providers.ClassifyError instead of local ad-hoc status parsing.
  3. 429 handling: treated as retryable via classifier; Retry-After is parsed and honored when present.
  4. Toolloop notice path: now explicit callback (RetryNotice) instead of calling an implicit message tool.
  5. Backoff jitter: configurable jitter added to retry delays.
  6. Missing tests: added targeted tests for retry engine, toolloop retries, cancellation during backoff, and agent-loop transient/non-retryable behavior.

Test Plan

  • go test ./pkg/utils -run 'TestLLMRetry'
  • go test ./pkg/tools -run 'TestRunToolLoop_'
  • go test ./pkg/agent -run 'TestAgentLoop_TransientLLMErrorRetry|TestAgentLoop_NonRetryableLLMError_NoRetry|TestAgentLoop_ContextExhaustionRetry'
  • go test ./pkg/providers -run 'TestClassifyError_TimeoutPatterns'
  • go test ./pkg/...

Fixes #629

Copilot AI review requested due to automatic review settings February 27, 2026 10:05
@pikaxinge
Copy link
Author

Addressed the full review checklist from #594 in this PR (deadline-aware retries, 429 + Retry-After, jitter, explicit toolloop notice callback, and added coverage).

@nikolasdehor could you take a look when you have time? I’d value your feedback on whether this resolves the concerns you raised on #594.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a shared, bounded retry engine for LLM calls and integrates it into both the agent loop and the tool loop, using the central provider error classifier for retry decisions (including 429 + Retry-After handling).

Changes:

  • Added utils.DoWithRetry + RetryPolicy with bounded elapsed time, per-attempt timeouts, cancellable backoff sleeps, and jitter support.
  • Switched retry classification to providers.ClassifyError, expanded transient timeout pattern matching, and added tests.
  • Integrated the retry engine into pkg/agent and pkg/tools with explicit retry-notice hooks.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
pkg/utils/llm_retry.go New shared retry engine (policy, decision classification, backoff/jitter, Retry-After parsing, user-facing notice formatting).
pkg/utils/llm_retry_test.go Unit tests for retry classification, bounded attempt timeout, cancellation during backoff, jittered backoff, and 429 Retry-After behavior.
pkg/tools/toolloop.go Wraps toolloop provider Chat calls with bounded transient retries and exposes retry notices via config callback.
pkg/tools/toolloop_test.go Tests toolloop retry vs non-retry behavior and notice emission.
pkg/providers/openai_compat/provider.go Includes Retry-After response header in non-200 errors to support retry delay honoring upstream.
pkg/providers/error_classifier.go Expands timeout/transient classification patterns (e.g., connection reset, TLS handshake timeout, EOF).
pkg/providers/error_classifier_test.go Extends timeout-pattern tests for newly added timeout patterns.
pkg/agent/loop.go Integrates bounded transient retries into the agent LLM iteration path; refactors context-window error detection into helper.
pkg/agent/loop_test.go Adds tests verifying transient retry behavior and ensuring it does not trigger context compression; adds non-retryable path test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Jitter RetryJitterFunc
}

var retryAfterPattern = regexp.MustCompile(`(?i)retry[- ]after[:=]?\s*([^\s\r\n]+)`)
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retryAfterPattern only captures a single non-whitespace token after Retry-After, which breaks parsing for standard HTTP-date values (e.g. "Wed, 21 Oct 2015 07:28:00 GMT") because it would capture just "Wed,". This prevents honoring Retry-After when servers return a date. Consider capturing the full remainder of the line (up to \r/\n) and add a test case for an RFC1123 Retry-After date.

Suggested change
var retryAfterPattern = regexp.MustCompile(`(?i)retry[- ]after[:=]?\s*([^\s\r\n]+)`)
var retryAfterPattern = regexp.MustCompile(`(?i)retry[- ]after[:=]?\s*([^\r\n]+)`)

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 07e3f16: retryAfterPattern now captures the full line ([^\r\n]+) and I added TestLLMRetry_ExtractRetryAfter_HTTPDate to cover RFC1123 date parsing.

Comment on lines +247 to +252
func defaultJitter(max time.Duration) time.Duration {
if max <= 0 {
return 0
}
//nolint:gosec // Used only for retry backoff jitter.
n := rand.Int63n(int64(max) + 1)
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defaultJitter uses the global math/rand generator without any seeding in the repo, so jitter will be deterministic across process starts (and across replicas), reducing the effectiveness of jitter for avoiding thundering-herd retries. Consider using a properly-seeded source (or math/rand/v2 if available) scoped to this package.

Suggested change
func defaultJitter(max time.Duration) time.Duration {
if max <= 0 {
return 0
}
//nolint:gosec // Used only for retry backoff jitter.
n := rand.Int63n(int64(max) + 1)
var jitterRand = rand.New(rand.NewSource(time.Now().UnixNano()))
func defaultJitter(max time.Duration) time.Duration {
if max <= 0 {
return 0
}
//nolint:gosec // Used only for retry backoff jitter.
n := jitterRand.Int63n(int64(max) + 1)

Copilot uses AI. Check for mistakes.
Comment on lines 39 to 48
timeoutPatterns = []errorPattern{
substr("timeout"),
substr("timed out"),
substr("deadline exceeded"),
substr("context deadline exceeded"),
substr("connection reset"),
substr("connection reset by peer"),
substr("tls handshake timeout"),
substr("eof"),
}
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeoutPatterns now includes "eof", but TestClassifyError_TimeoutPatterns doesn’t cover it (and currently also doesn’t cover the new "connection reset" variant). Adding test cases for these new patterns would prevent regressions in transient/timeout classification.

Copilot uses AI. Check for mistakes.
Comment on lines 236 to 246
timer := time.NewTimer(d)
defer timer.Stop()

select {
case <-ctx.Done():
return ctx.Err()
case <-timer.C:
return nil
}
}

Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

llm_retry.go introduces sleepWithContext, but pkg/utils/http_retry.go already has a package-private sleepWithCtx with the same behavior/signature. Since both are in package utils, consider reusing the existing helper (or consolidating to one name) to avoid duplicated cancellation-sleep logic.

Suggested change
timer := time.NewTimer(d)
defer timer.Stop()
select {
case <-ctx.Done():
return ctx.Err()
case <-timer.C:
return nil
}
}
return sleepWithCtx(ctx, d)
}

Copilot uses AI. Check for mistakes.
Copy link

@nikolasdehor nikolasdehor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a well-structured retry implementation that addresses the reviewer concerns from #594. Key observations:

Architecture: The retry engine in pkg/utils/llm_retry.go is generic (uses Go generics), testable (injectable sleep/jitter/notify functions), and properly separates concerns:

  • Context-window errors trigger compression (agent loop concern)
  • Transient errors trigger retry with backoff (shared infrastructure concern)

Correctness of retry classification: Using providers.ClassifyError instead of ad-hoc string matching is the right approach. The new timeout patterns (connection reset, tls handshake timeout, eof) are appropriate additions.

Parent deadline vs per-attempt timeout: boundedAttemptTimeout caps the attempt timeout to the remaining parent deadline, preventing the double-timeout/burned-attempt problem. This is correct.

429/Retry-After handling: The extractRetryAfter function parses both numeric seconds and HTTP-date formats from the error message. The regex approach works because the error message format is controlled by the provider code (which now includes Retry-After in the error string via the openai_compat change).

Jitter: Configurable with sensible 500ms default. Clamped to [0, MaxJitter] range.

Test coverage: Tests cover 429 with Retry-After, parent deadline not burning attempts, cancellation during backoff, jittered backoff bounds, and both toolloop and agent loop integration. Good coverage.

Minor observations (non-blocking):

  1. isContextWindowError uses broad string matching (e.g., "token" alone would match "invalid token" or "authentication token"). The narrower patterns like "token limit", "too many tokens" are better. Consider removing the invalidparameter fallback that matches standalone "token" -- it could false-positive on auth errors.

  2. The Retry-After regex in llm_retry.go parses from the error message string, which couples it to the error format in openai_compat/provider.go. If another provider formats the error differently, Retry-After parsing would silently fail. This is acceptable for now but worth a comment.

  3. sleepWithCtx is referenced but not defined in the diff -- assuming it is a pre-existing utility function.

  4. #857 appears to be an earlier iteration of this same work. Consider closing #857 if this one is the canonical version.

LGTM. This supersedes #857 and addresses the #594 feedback comprehensively.

…te-retry

# Conflicts:
#	pkg/agent/loop_test.go
Copilot AI review requested due to automatic review settings February 28, 2026 04:42
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if existingNotify != nil {
existingNotify(notice)
}
if config.RetryNotice != nil && channel != "" && chatID != "" {
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RetryNotice is only invoked when channel and chatID are non-empty. Since the callback is already optional and can decide what to do, this conditional can unexpectedly suppress notices for callers that intentionally pass empty identifiers (e.g., CLI/internal runs). Consider calling config.RetryNotice(...) whenever it is non-nil, and let the implementation decide whether to ignore based on channel/chatID.

Suggested change
if config.RetryNotice != nil && channel != "" && chatID != "" {
if config.RetryNotice != nil {

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f8720e1: RetryNotice callback is now invoked whenever config.RetryNotice is non-nil, without requiring non-empty channel/chatID.

if attemptTimeout := policy.AttemptTimeouts[attempt]; attemptTimeout > 0 {
timeout, ok := boundedAttemptTimeout(runCtx, attemptTimeout)
if !ok {
return zero, runCtx.Err()
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the !ok path, DoWithRetry returns runCtx.Err(). When boundedAttemptTimeout returns ok=false because time.Until(deadline) <= 0, it's possible for runCtx.Err() to still be nil briefly, which would cause DoWithRetry to return a nil error with the zero value result. Consider returning context.DeadlineExceeded (or ctx.Err() with a fallback) when ok is false to avoid a nil error return.

Suggested change
return zero, runCtx.Err()
if err := runCtx.Err(); err != nil {
return zero, err
}
return zero, context.DeadlineExceeded

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f8720e1: when boundedAttemptTimeout returns ok=false and runCtx.Err() is nil, DoWithRetry now returns context.DeadlineExceeded instead of nil.

Comment on lines +235 to +241
func defaultJitter(max time.Duration) time.Duration {
if max <= 0 {
return 0
}
//nolint:gosec // Used only for retry backoff jitter.
n := rand.Int63n(int64(max) + 1)
return time.Duration(n)
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defaultJitter uses the global math/rand source without seeding, so jitter will be deterministic across process starts. For retry backoff jitter this can lead to synchronized retries after restarts. Consider using a locally-seeded RNG (e.g., rand.New(rand.NewSource(time.Now().UnixNano()))) or switching to crypto/rand for the default jitter implementation.

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kept current implementation intentionally: this repo targets Go 1.25.7, and in Go >=1.20 the default math/rand generator is auto-seeded at startup. We are using the package-level concurrent-safe API (rand.Int63n) and not calling Seed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Didn't retry if meet LLM call failed

3 participants