feat(retry): complete bounded retry policy for LLM calls by pikaxinge · Pull Request #866 · sipeed/picoclaw

pikaxinge · 2026-02-27T10:05:41Z

Summary

add a shared retry engine (pkg/utils/llm_retry.go) for LLM calls with bounded elapsed time, per-attempt timeouts, retry classification, and cancellable waits
use providers.ClassifyError for retry decisions (timeout/rate_limit), including 429 handling and Retry-After parsing support
add jittered backoff and parent-deadline-aware timeout capping to avoid double-timeout/burn-attempt behavior
integrate retry engine into both pkg/agent/loop.go and pkg/tools/toolloop.go
preserve context-window compression retry path in agent loop as a separate concern
replace implicit toolloop notice path with explicit RetryNotice callback in ToolLoopConfig
extend transient timeout classification (connection reset, tls handshake timeout) and add tests

Addresses #594 reviewer concerns

Parent deadline vs per-attempt timeout: attempt timeout is capped by remaining parent deadline.
Fragile retry classification: retry decisions use providers.ClassifyError instead of local ad-hoc status parsing.
429 handling: treated as retryable via classifier; Retry-After is parsed and honored when present.
Toolloop notice path: now explicit callback (RetryNotice) instead of calling an implicit message tool.
Backoff jitter: configurable jitter added to retry delays.
Missing tests: added targeted tests for retry engine, toolloop retries, cancellation during backoff, and agent-loop transient/non-retryable behavior.

Test Plan

go test ./pkg/utils -run 'TestLLMRetry'
go test ./pkg/tools -run 'TestRunToolLoop_'
go test ./pkg/agent -run 'TestAgentLoop_TransientLLMErrorRetry|TestAgentLoop_NonRetryableLLMError_NoRetry|TestAgentLoop_ContextExhaustionRetry'
go test ./pkg/providers -run 'TestClassifyError_TimeoutPatterns'
go test ./pkg/...

Fixes #629

pikaxinge · 2026-02-27T10:06:29Z

Addressed the full review checklist from #594 in this PR (deadline-aware retries, 429 + Retry-After, jitter, explicit toolloop notice callback, and added coverage).

@nikolasdehor could you take a look when you have time? I’d value your feedback on whether this resolves the concerns you raised on #594.

Copilot

Pull request overview

This PR introduces a shared, bounded retry engine for LLM calls and integrates it into both the agent loop and the tool loop, using the central provider error classifier for retry decisions (including 429 + Retry-After handling).

Changes:

Added utils.DoWithRetry + RetryPolicy with bounded elapsed time, per-attempt timeouts, cancellable backoff sleeps, and jitter support.
Switched retry classification to providers.ClassifyError, expanded transient timeout pattern matching, and added tests.
Integrated the retry engine into pkg/agent and pkg/tools with explicit retry-notice hooks.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
pkg/utils/llm_retry.go	New shared retry engine (policy, decision classification, backoff/jitter, Retry-After parsing, user-facing notice formatting).
pkg/utils/llm_retry_test.go	Unit tests for retry classification, bounded attempt timeout, cancellation during backoff, jittered backoff, and 429 Retry-After behavior.
pkg/tools/toolloop.go	Wraps toolloop provider `Chat` calls with bounded transient retries and exposes retry notices via config callback.
pkg/tools/toolloop_test.go	Tests toolloop retry vs non-retry behavior and notice emission.
pkg/providers/openai_compat/provider.go	Includes `Retry-After` response header in non-200 errors to support retry delay honoring upstream.
pkg/providers/error_classifier.go	Expands timeout/transient classification patterns (e.g., connection reset, TLS handshake timeout, EOF).
pkg/providers/error_classifier_test.go	Extends timeout-pattern tests for newly added timeout patterns.
pkg/agent/loop.go	Integrates bounded transient retries into the agent LLM iteration path; refactors context-window error detection into helper.
pkg/agent/loop_test.go	Adds tests verifying transient retry behavior and ensuring it does not trigger context compression; adds non-retryable path test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-27T10:11:06Z

pkg/utils/llm_retry.go

+	Jitter          RetryJitterFunc
+}
+
+var retryAfterPattern = regexp.MustCompile(`(?i)retry[- ]after[:=]?\s*([^\s\r\n]+)`)


retryAfterPattern only captures a single non-whitespace token after Retry-After, which breaks parsing for standard HTTP-date values (e.g. "Wed, 21 Oct 2015 07:28:00 GMT") because it would capture just "Wed,". This prevents honoring Retry-After when servers return a date. Consider capturing the full remainder of the line (up to \r/\n) and add a test case for an RFC1123 Retry-After date.

Suggested change

var retryAfterPattern = regexp.MustCompile(`(?i)retry[- ]after[:=]?\s*([^\s\r\n]+)`)

var retryAfterPattern = regexp.MustCompile(`(?i)retry[- ]after[:=]?\s*([^\r\n]+)`)

Fixed in 07e3f16: retryAfterPattern now captures the full line ([^\r\n]+) and I added TestLLMRetry_ExtractRetryAfter_HTTPDate to cover RFC1123 date parsing.

Copilot · 2026-02-27T10:11:07Z

pkg/utils/llm_retry.go

+func defaultJitter(max time.Duration) time.Duration {
+	if max <= 0 {
+		return 0
+	}
+	//nolint:gosec // Used only for retry backoff jitter.
+	n := rand.Int63n(int64(max) + 1)


defaultJitter uses the global math/rand generator without any seeding in the repo, so jitter will be deterministic across process starts (and across replicas), reducing the effectiveness of jitter for avoiding thundering-herd retries. Consider using a properly-seeded source (or math/rand/v2 if available) scoped to this package.

Suggested change

func defaultJitter(max time.Duration) time.Duration {

if max <= 0 {

return 0

}

//nolint:gosec // Used only for retry backoff jitter.

n := rand.Int63n(int64(max) + 1)

var jitterRand = rand.New(rand.NewSource(time.Now().UnixNano()))

func defaultJitter(max time.Duration) time.Duration {

if max <= 0 {

return 0

}

//nolint:gosec // Used only for retry backoff jitter.

n := jitterRand.Int63n(int64(max) + 1)

Copilot · 2026-02-27T10:11:07Z

pkg/providers/error_classifier.go

 	timeoutPatterns = []errorPattern{
 		substr("timeout"),
 		substr("timed out"),
 		substr("deadline exceeded"),
 		substr("context deadline exceeded"),
+		substr("connection reset"),
+		substr("connection reset by peer"),
+		substr("tls handshake timeout"),
+		substr("eof"),
 	}


timeoutPatterns now includes "eof", but TestClassifyError_TimeoutPatterns doesn’t cover it (and currently also doesn’t cover the new "connection reset" variant). Adding test cases for these new patterns would prevent regressions in transient/timeout classification.

Copilot · 2026-02-27T10:11:07Z

pkg/utils/llm_retry.go

+	timer := time.NewTimer(d)
+	defer timer.Stop()
+
+	select {
+	case <-ctx.Done():
+		return ctx.Err()
+	case <-timer.C:
+		return nil
+	}
+}
+


llm_retry.go introduces sleepWithContext, but pkg/utils/http_retry.go already has a package-private sleepWithCtx with the same behavior/signature. Since both are in package utils, consider reusing the existing helper (or consolidating to one name) to avoid duplicated cancellation-sleep logic.

Suggested change

timer := time.NewTimer(d)

defer timer.Stop()

select {

case <-ctx.Done():

return ctx.Err()

case <-timer.C:

return nil

}

}

return sleepWithCtx(ctx, d)

}

nikolasdehor

This is a well-structured retry implementation that addresses the reviewer concerns from #594. Key observations:

Architecture: The retry engine in pkg/utils/llm_retry.go is generic (uses Go generics), testable (injectable sleep/jitter/notify functions), and properly separates concerns:

Context-window errors trigger compression (agent loop concern)
Transient errors trigger retry with backoff (shared infrastructure concern)

Correctness of retry classification: Using providers.ClassifyError instead of ad-hoc string matching is the right approach. The new timeout patterns (connection reset, tls handshake timeout, eof) are appropriate additions.

Parent deadline vs per-attempt timeout: boundedAttemptTimeout caps the attempt timeout to the remaining parent deadline, preventing the double-timeout/burned-attempt problem. This is correct.

429/Retry-After handling: The extractRetryAfter function parses both numeric seconds and HTTP-date formats from the error message. The regex approach works because the error message format is controlled by the provider code (which now includes Retry-After in the error string via the openai_compat change).

Jitter: Configurable with sensible 500ms default. Clamped to [0, MaxJitter] range.

Test coverage: Tests cover 429 with Retry-After, parent deadline not burning attempts, cancellation during backoff, jittered backoff bounds, and both toolloop and agent loop integration. Good coverage.

Minor observations (non-blocking):

isContextWindowError uses broad string matching (e.g., "token" alone would match "invalid token" or "authentication token"). The narrower patterns like "token limit", "too many tokens" are better. Consider removing the invalidparameter fallback that matches standalone "token" -- it could false-positive on auth errors.
The Retry-After regex in llm_retry.go parses from the error message string, which couples it to the error format in openai_compat/provider.go. If another provider formats the error differently, Retry-After parsing would silently fail. This is acceptable for now but worth a comment.
sleepWithCtx is referenced but not defined in the diff -- assuming it is a pre-existing utility function.
#857 appears to be an earlier iteration of this same work. Consider closing #857 if this one is the canonical version.

LGTM. This supersedes #857 and addresses the #594 feedback comprehensively.

…te-retry # Conflicts: # pkg/agent/loop_test.go

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-28T04:46:13Z

pkg/tools/toolloop.go

+			if existingNotify != nil {
+				existingNotify(notice)
+			}
+			if config.RetryNotice != nil && channel != "" && chatID != "" {


RetryNotice is only invoked when channel and chatID are non-empty. Since the callback is already optional and can decide what to do, this conditional can unexpectedly suppress notices for callers that intentionally pass empty identifiers (e.g., CLI/internal runs). Consider calling config.RetryNotice(...) whenever it is non-nil, and let the implementation decide whether to ignore based on channel/chatID.

Suggested change

if config.RetryNotice != nil && channel != "" && chatID != "" {

if config.RetryNotice != nil {

Fixed in f8720e1: RetryNotice callback is now invoked whenever config.RetryNotice is non-nil, without requiring non-empty channel/chatID.

Copilot · 2026-02-28T04:46:14Z

pkg/utils/llm_retry.go

+		if attemptTimeout := policy.AttemptTimeouts[attempt]; attemptTimeout > 0 {
+			timeout, ok := boundedAttemptTimeout(runCtx, attemptTimeout)
+			if !ok {
+				return zero, runCtx.Err()


In the !ok path, DoWithRetry returns runCtx.Err(). When boundedAttemptTimeout returns ok=false because time.Until(deadline) <= 0, it's possible for runCtx.Err() to still be nil briefly, which would cause DoWithRetry to return a nil error with the zero value result. Consider returning context.DeadlineExceeded (or ctx.Err() with a fallback) when ok is false to avoid a nil error return.

Suggested change

return zero, runCtx.Err()

if err := runCtx.Err(); err != nil {

return zero, err

}

return zero, context.DeadlineExceeded

Fixed in f8720e1: when boundedAttemptTimeout returns ok=false and runCtx.Err() is nil, DoWithRetry now returns context.DeadlineExceeded instead of nil.

Copilot · 2026-02-28T04:46:14Z

pkg/utils/llm_retry.go

+func defaultJitter(max time.Duration) time.Duration {
+	if max <= 0 {
+		return 0
+	}
+	//nolint:gosec // Used only for retry backoff jitter.
+	n := rand.Int63n(int64(max) + 1)
+	return time.Duration(n)


defaultJitter uses the global math/rand source without seeding, so jitter will be deterministic across process starts. For retry backoff jitter this can lead to synchronized retries after restarts. Consider using a locally-seeded RNG (e.g., rand.New(rand.NewSource(time.Now().UnixNano()))) or switching to crypto/rand for the default jitter implementation.

Kept current implementation intentionally: this repo targets Go 1.25.7, and in Go >=1.20 the default math/rand generator is auto-seeded at startup. We are using the package-level concurrent-safe API (rand.Int63n) and not calling Seed.

feat(retry): add complete bounded LLM retry handling (sipeed#629)

e171d77

Copilot AI review requested due to automatic review settings February 27, 2026 10:05

Copilot started reviewing on behalf of pikaxinge February 27, 2026 10:06 View session

Copilot AI reviewed Feb 27, 2026

View reviewed changes

fix(retry): handle Retry-After date and align review nits

07e3f16

nikolasdehor mentioned this pull request Feb 28, 2026

agent: retry transient LLM failures in loop #857

Closed

nikolasdehor approved these changes Feb 28, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into feat/issue-629-comple…

cb1d727

…te-retry # Conflicts: # pkg/agent/loop_test.go

Copilot AI review requested due to automatic review settings February 28, 2026 04:42

Copilot started reviewing on behalf of pikaxinge February 28, 2026 04:42 View session

Copilot AI reviewed Feb 28, 2026

View reviewed changes

fix(retry): address review edge cases in notices and deadline handling

f8720e1

	var retryAfterPattern = regexp.MustCompile(`(?i)retry[- ]after[:=]?\s*([^\s\r\n]+)`)
	var retryAfterPattern = regexp.MustCompile(`(?i)retry[- ]after[:=]?\s*([^\r\n]+)`)

	if config.RetryNotice != nil && channel != "" && chatID != "" {
	if config.RetryNotice != nil {

-				return zero, runCtx.Err()
+				if err := runCtx.Err(); err != nil {
+					return zero, err
+				}
+				return zero, context.DeadlineExceeded

Conversation

pikaxinge commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Addresses #594 reviewer concerns

Test Plan

Uh oh!

pikaxinge commented Feb 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

pikaxinge Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

nikolasdehor left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

pikaxinge Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

pikaxinge Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

pikaxinge Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pikaxinge commented Feb 27, 2026 •

edited

Loading