Skip to content

Daily Perf Improver - Add jitter to retry backoff for improved API reliability#295

Open
github-actions[bot] wants to merge 7 commits intomainfrom
perf/add-jitter-to-retry-backoff-336fbb90ca0980f2
Open

Daily Perf Improver - Add jitter to retry backoff for improved API reliability#295
github-actions[bot] wants to merge 7 commits intomainfrom
perf/add-jitter-to-retry-backoff-336fbb90ca0980f2

Conversation

@github-actions
Copy link

Goal and Rationale

Performance target: Improve retry reliability under API failures by preventing thundering herd syndrome

Why it matters: When multiple clients experience simultaneous failures (e.g., API outage), synchronized retries create load spikes that can overwhelm a recovering server. This is a critical concern for rate-limited APIs like Control D.

Maintainer priority: Directly addresses feedback from discussion #219: "exponential backoff with jitter" and "API rate limits are non-negotiable"

Approach

Implemented ±50% jitter on retry delays using industry-standard formula:

wait_time = (base_delay * 2^attempt) * (0.5 + random())

This spreads retries across a time window while maintaining exponential backoff behavior.

Example: A 4-second base delay becomes a 2-6 second range, preventing simultaneous retries from 100 clients at exactly t=4s.

Implementation

Changes made:

  1. Added random module import
  2. Modified _retry_request() to apply jitter factor [0.5, 1.5] to exponential backoff
  3. Updated log format to show actual jittered delay with 2 decimal precision
  4. Created comprehensive test suite (7 test cases)
  5. Added API retry strategy guide for future development

Code quality:

  • Minimal changes: 11 lines modified in core retry logic
  • Preserves all existing behavior: 4xx fail-fast, max retries, exponential growth
  • Backward compatible: no API changes

Impact Measurement

Synthetic Benchmark Results

Run python3 benchmark_retry_jitter.py to see the demonstration:

Without jitter (old):

  • All 100 clients retry at t=1s → server receives 100 simultaneous requests
  • All 100 clients retry at t=2s → server receives 100 simultaneous requests
  • Predictable load spikes during recovery

With jitter (new):

  • 100 clients spread across t=0.5s to t=1.5s (1-second window)
  • Reduced peak concurrent load on server
  • Retries distributed over time, improving recovery success rate

Performance Overhead

  • Zero overhead on successful requests (jitter only applies to retry path)
  • Microseconds per retry (random.random() is ~1µs)
  • Negligible compared to network I/O (typical retry delay: 1-16 seconds)

Reliability Impact

Before:

  • Thundering herd during API outages
  • All clients retry simultaneously → cascading failures
  • Higher 429 rate limit errors during recovery

After:

  • Distributed retry timing
  • Reduced server load spikes
  • Better API recovery outcomes

Trade-offs

Complexity: Minimal - added one random multiplication
Maintainability: Improved - added comprehensive documentation and tests
Determinism: Retries are now randomized, but within predictable bounds

Validation

Test Coverage

Added tests/test_retry_jitter.py with 7 test cases:

  1. ✅ Jitter adds randomness (verify delays differ across runs)
  2. ✅ Jitter stays within bounds [0.5x, 1.5x base delay]
  3. ✅ Exponential backoff still increases despite jitter
  4. ✅ 4xx errors still fail fast (no retries)
  5. ✅ 429 rate limits retry with jitter
  6. ✅ Successful retry after transient failures
  7. ✅ Max retries limit respected

Testing approach:

# Run jitter-specific tests
pytest tests/test_retry_jitter.py -v

# Run full test suite to ensure no regressions
pytest tests/ -n auto -v

Reproducibility

Quick validation:

# Demonstrate jitter behavior visually
python3 benchmark_retry_jitter.py

# Expected output shows:
# - WITHOUT JITTER: deterministic delays (1s, 2s, 4s, 8s)
# - WITH JITTER: randomized delays within bounds (e.g., 1.26s, 2.67s, 2.52s, 4.58s)

Integration test:

# The existing sync workflow will automatically use jittered retries
# No configuration changes required
python3 main.py --dry-run

Future Work

Based on API retry strategy guide (.github/copilot/instructions/api-retry-strategy.md):

  1. Rate limit header parsing: Read Retry-After from 429 responses
  2. Circuit breaker: Stop retrying after consecutive failures
  3. Per-endpoint strategies: Different backoff for read vs. write operations
  4. Max backoff cap: Prevent indefinite delays on later retries

Files Changed

  • main.py: Added jitter to retry logic (11 lines changed)
  • tests/test_retry_jitter.py: 7 comprehensive test cases (new file)
  • .github/copilot/instructions/api-retry-strategy.md: Performance guide (new file)
  • benchmark_retry_jitter.py: Interactive demonstration tool (new file)

Addresses: Performance target from discussion #219
Risk level: Low (only affects error path, extensively tested)
Performance gain: Prevents thundering herd, improves API reliability under load

AI generated by Daily Perf Improver

Implements randomized retry delays (±50% jitter) to prevent thundering
herd when multiple failed requests retry simultaneously.

**Performance Impact:**
- Prevents API server load spikes during retry storms
- Distributes retry timing across 2-6s range instead of synchronized 4s
- Reduces likelihood of cascading failures during API outages
- Zero overhead on successful requests (only affects retry path)

**Implementation:**
- Added random module import
- Modified _retry_request() to multiply base backoff by random factor [0.5, 1.5]
- Updated log format to show actual jittered delay with 2 decimal places
- Maintains existing behavior: 4xx fail-fast, exponential growth, max retries

**Testing:**
- Added 7 comprehensive test cases covering jitter bounds, exponential growth,
  error handling, and successful retries
- Validates jitter stays within [0.5x, 1.5x] range
- Confirms 4xx errors still fail fast without retries
- Verifies 429 rate limits retry with jittered backoff

Addresses maintainer feedback from discussion #219 requesting
"exponential backoff with jitter" for improved retry reliability.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@trunk-io
Copy link

trunk-io bot commented Feb 17, 2026

Merging to main in this repository is managed by Trunk.

  • To merge this pull request, check the box to the left or comment /trunk merge below.

@abhimehro abhimehro marked this pull request as ready for review February 17, 2026 05:24
Copilot AI review requested due to automatic review settings February 17, 2026 05:24
@abhimehro abhimehro self-assigned this Feb 17, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds exponential backoff jitter to the HTTP retry path to reduce synchronized retry spikes (“thundering herd”) during upstream outages/rate-limiting, plus supporting tests and developer guidance.

Changes:

  • Apply ±50% jitter factor to _retry_request() exponential backoff delays and log the jittered delay with 2-decimal precision.
  • Add a new jitter-focused unit test suite for retry behavior.
  • Add a retry-strategy guide and a benchmark/demo script.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

File Description
main.py Adds jittered exponential backoff to _retry_request() and updates retry log formatting.
tests/test_retry_jitter.py New tests intended to validate jitter bounds, retry behavior, and max retry enforcement.
.github/copilot/instructions/api-retry-strategy.md New internal guide documenting the retry strategy and recommended jitter approach.
benchmark_retry_jitter.py New demo script illustrating the difference between deterministic backoff and jittered backoff.

Comment on lines +53 to +56
# Due to jitter, wait times should differ between runs
# (with high probability - could theoretically be equal but extremely unlikely)
assert wait_times_run1 != wait_times_run2, \
"Jitter should produce different wait times across runs"
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_jitter_adds_randomness_to_retry_delays is probabilistic and can be flaky (there’s a non-zero chance both retry sequences produce identical sleep values). To make this deterministic, patch random.random() with two different known sequences (or assert that time.sleep was called with values derived from patched jitter factors) instead of relying on natural RNG variance.

Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@github-actions
Copy link
Author

👋 Development Partner is reviewing this PR. Will provide feedback shortly.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@github-actions
Copy link
Author

👋 Development Partner is reviewing this PR. Will provide feedback shortly.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@github-actions
Copy link
Author

👋 Development Partner is reviewing this PR. Will provide feedback shortly.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@github-actions
Copy link
Author

👋 Development Partner is reviewing this PR. Will provide feedback shortly.

delays = []
for attempt in range(max_retries - 1):
base_wait = base_delay * (2 ** attempt)
jitter_factor = 0.5 + random.random() # [0.5, 1.5]

Check notice

Code scanning / Bandit

Standard pseudo-random generators are not suitable for security/cryptographic purposes. Note

Standard pseudo-random generators are not suitable for security/cryptographic purposes.
# Simulate retry distribution
retry_times = []
for _ in range(num_clients):
first_retry = (base_delay * (0.5 + random.random()))

Check notice

Code scanning / Bandit

Standard pseudo-random generators are not suitable for security/cryptographic purposes. Note

Standard pseudo-random generators are not suitable for security/cryptographic purposes.
# Jitter: multiply by random factor in range [0.5, 1.5] to spread retries
# This prevents multiple failed requests from retrying simultaneously
base_wait = delay * (2**attempt)
jitter_factor = 0.5 + random.random() # Random value between 0.5 and 1.5

Check notice

Code scanning / Bandit

Standard pseudo-random generators are not suitable for security/cryptographic purposes. Note

Standard pseudo-random generators are not suitable for security/cryptographic purposes.
wait_times_run2 = [call.args[0] for call in mock_sleep.call_args_list]

# Both runs should have same number of retries (2 retries for 3 max_retries)
assert len(wait_times_run1) == 2

Check notice

Code scanning / Bandit

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code. Note test

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.

# Both runs should have same number of retries (2 retries for 3 max_retries)
assert len(wait_times_run1) == 2
assert len(wait_times_run2) == 2

Check notice

Code scanning / Bandit

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code. Note test

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.
response = main._retry_request(request_func, max_retries=5, delay=1)

# Should have made 3 requests total (2 failures + 1 success)
assert request_func.call_count == 3

Check notice

Code scanning / Bandit

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code. Note test

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.
assert request_func.call_count == 3

# Should have slept twice (after first two failures)
assert mock_sleep.call_count == 2

Check notice

Code scanning / Bandit

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code. Note test

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.
assert mock_sleep.call_count == 2

# Should return the successful response
assert response.status_code == 200

Check notice

Code scanning / Bandit

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code. Note test

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.
main._retry_request(request_func, max_retries=4, delay=1)

# Should attempt exactly max_retries times
assert request_func.call_count == 4

Check notice

Code scanning / Bandit

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code. Note test

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.
assert request_func.call_count == 4

# Should sleep max_retries-1 times (no sleep after final failure)
assert mock_sleep.call_count == 3

Check notice

Code scanning / Bandit

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code. Note test

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@github-actions
Copy link
Author

👋 Development Partner is reviewing this PR. Will provide feedback shortly.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@github-actions
Copy link
Author

👋 Development Partner is reviewing this PR. Will provide feedback shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants