Daily Perf Improver - Add jitter to retry backoff for improved API reliability#295
Daily Perf Improver - Add jitter to retry backoff for improved API reliability#295github-actions[bot] wants to merge 7 commits intomainfrom
Conversation
Implements randomized retry delays (±50% jitter) to prevent thundering herd when multiple failed requests retry simultaneously. **Performance Impact:** - Prevents API server load spikes during retry storms - Distributes retry timing across 2-6s range instead of synchronized 4s - Reduces likelihood of cascading failures during API outages - Zero overhead on successful requests (only affects retry path) **Implementation:** - Added random module import - Modified _retry_request() to multiply base backoff by random factor [0.5, 1.5] - Updated log format to show actual jittered delay with 2 decimal places - Maintains existing behavior: 4xx fail-fast, exponential growth, max retries **Testing:** - Added 7 comprehensive test cases covering jitter bounds, exponential growth, error handling, and successful retries - Validates jitter stays within [0.5x, 1.5x] range - Confirms 4xx errors still fail fast without retries - Verifies 429 rate limits retry with jittered backoff Addresses maintainer feedback from discussion #219 requesting "exponential backoff with jitter" for improved retry reliability. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Merging to
|
There was a problem hiding this comment.
Pull request overview
Adds exponential backoff jitter to the HTTP retry path to reduce synchronized retry spikes (“thundering herd”) during upstream outages/rate-limiting, plus supporting tests and developer guidance.
Changes:
- Apply ±50% jitter factor to
_retry_request()exponential backoff delays and log the jittered delay with 2-decimal precision. - Add a new jitter-focused unit test suite for retry behavior.
- Add a retry-strategy guide and a benchmark/demo script.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
main.py |
Adds jittered exponential backoff to _retry_request() and updates retry log formatting. |
tests/test_retry_jitter.py |
New tests intended to validate jitter bounds, retry behavior, and max retry enforcement. |
.github/copilot/instructions/api-retry-strategy.md |
New internal guide documenting the retry strategy and recommended jitter approach. |
benchmark_retry_jitter.py |
New demo script illustrating the difference between deterministic backoff and jittered backoff. |
| # Due to jitter, wait times should differ between runs | ||
| # (with high probability - could theoretically be equal but extremely unlikely) | ||
| assert wait_times_run1 != wait_times_run2, \ | ||
| "Jitter should produce different wait times across runs" |
There was a problem hiding this comment.
test_jitter_adds_randomness_to_retry_delays is probabilistic and can be flaky (there’s a non-zero chance both retry sequences produce identical sleep values). To make this deterministic, patch random.random() with two different known sequences (or assert that time.sleep was called with values derived from patched jitter factors) instead of relying on natural RNG variance.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
👋 Development Partner is reviewing this PR. Will provide feedback shortly. |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
👋 Development Partner is reviewing this PR. Will provide feedback shortly. |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
👋 Development Partner is reviewing this PR. Will provide feedback shortly. |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
👋 Development Partner is reviewing this PR. Will provide feedback shortly. |
| delays = [] | ||
| for attempt in range(max_retries - 1): | ||
| base_wait = base_delay * (2 ** attempt) | ||
| jitter_factor = 0.5 + random.random() # [0.5, 1.5] |
Check notice
Code scanning / Bandit
Standard pseudo-random generators are not suitable for security/cryptographic purposes. Note
| # Simulate retry distribution | ||
| retry_times = [] | ||
| for _ in range(num_clients): | ||
| first_retry = (base_delay * (0.5 + random.random())) |
Check notice
Code scanning / Bandit
Standard pseudo-random generators are not suitable for security/cryptographic purposes. Note
| # Jitter: multiply by random factor in range [0.5, 1.5] to spread retries | ||
| # This prevents multiple failed requests from retrying simultaneously | ||
| base_wait = delay * (2**attempt) | ||
| jitter_factor = 0.5 + random.random() # Random value between 0.5 and 1.5 |
Check notice
Code scanning / Bandit
Standard pseudo-random generators are not suitable for security/cryptographic purposes. Note
| wait_times_run2 = [call.args[0] for call in mock_sleep.call_args_list] | ||
|
|
||
| # Both runs should have same number of retries (2 retries for 3 max_retries) | ||
| assert len(wait_times_run1) == 2 |
Check notice
Code scanning / Bandit
Use of assert detected. The enclosed code will be removed when compiling to optimised byte code. Note test
|
|
||
| # Both runs should have same number of retries (2 retries for 3 max_retries) | ||
| assert len(wait_times_run1) == 2 | ||
| assert len(wait_times_run2) == 2 |
Check notice
Code scanning / Bandit
Use of assert detected. The enclosed code will be removed when compiling to optimised byte code. Note test
| response = main._retry_request(request_func, max_retries=5, delay=1) | ||
|
|
||
| # Should have made 3 requests total (2 failures + 1 success) | ||
| assert request_func.call_count == 3 |
Check notice
Code scanning / Bandit
Use of assert detected. The enclosed code will be removed when compiling to optimised byte code. Note test
| assert request_func.call_count == 3 | ||
|
|
||
| # Should have slept twice (after first two failures) | ||
| assert mock_sleep.call_count == 2 |
Check notice
Code scanning / Bandit
Use of assert detected. The enclosed code will be removed when compiling to optimised byte code. Note test
| assert mock_sleep.call_count == 2 | ||
|
|
||
| # Should return the successful response | ||
| assert response.status_code == 200 |
Check notice
Code scanning / Bandit
Use of assert detected. The enclosed code will be removed when compiling to optimised byte code. Note test
| main._retry_request(request_func, max_retries=4, delay=1) | ||
|
|
||
| # Should attempt exactly max_retries times | ||
| assert request_func.call_count == 4 |
Check notice
Code scanning / Bandit
Use of assert detected. The enclosed code will be removed when compiling to optimised byte code. Note test
| assert request_func.call_count == 4 | ||
|
|
||
| # Should sleep max_retries-1 times (no sleep after final failure) | ||
| assert mock_sleep.call_count == 3 |
Check notice
Code scanning / Bandit
Use of assert detected. The enclosed code will be removed when compiling to optimised byte code. Note test
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
👋 Development Partner is reviewing this PR. Will provide feedback shortly. |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
👋 Development Partner is reviewing this PR. Will provide feedback shortly. |
Goal and Rationale
Performance target: Improve retry reliability under API failures by preventing thundering herd syndrome
Why it matters: When multiple clients experience simultaneous failures (e.g., API outage), synchronized retries create load spikes that can overwhelm a recovering server. This is a critical concern for rate-limited APIs like Control D.
Maintainer priority: Directly addresses feedback from discussion #219: "exponential backoff with jitter" and "API rate limits are non-negotiable"
Approach
Implemented ±50% jitter on retry delays using industry-standard formula:
This spreads retries across a time window while maintaining exponential backoff behavior.
Example: A 4-second base delay becomes a 2-6 second range, preventing simultaneous retries from 100 clients at exactly t=4s.
Implementation
Changes made:
randommodule import_retry_request()to apply jitter factor [0.5, 1.5] to exponential backoffCode quality:
Impact Measurement
Synthetic Benchmark Results
Run
python3 benchmark_retry_jitter.pyto see the demonstration:Without jitter (old):
With jitter (new):
Performance Overhead
random.random()is ~1µs)Reliability Impact
Before:
After:
Trade-offs
Complexity: Minimal - added one random multiplication
Maintainability: Improved - added comprehensive documentation and tests
Determinism: Retries are now randomized, but within predictable bounds
Validation
Test Coverage
Added
tests/test_retry_jitter.pywith 7 test cases:Testing approach:
Reproducibility
Quick validation:
Integration test:
Future Work
Based on API retry strategy guide (
.github/copilot/instructions/api-retry-strategy.md):Retry-Afterfrom 429 responsesFiles Changed
main.py: Added jitter to retry logic (11 lines changed)tests/test_retry_jitter.py: 7 comprehensive test cases (new file).github/copilot/instructions/api-retry-strategy.md: Performance guide (new file)benchmark_retry_jitter.py: Interactive demonstration tool (new file)Addresses: Performance target from discussion #219
Risk level: Low (only affects error path, extensively tested)
Performance gain: Prevents thundering herd, improves API reliability under load