Skip to content

feat: Add retry option for fetching rulesets#38

Open
stephenreid wants to merge 4 commits intostatsig-io:mainfrom
stephenreid:Retry-ruleset-downloads
Open

feat: Add retry option for fetching rulesets#38
stephenreid wants to merge 4 commits intostatsig-io:mainfrom
stephenreid:Retry-ruleset-downloads

Conversation

@stephenreid
Copy link

@stephenreid stephenreid commented Feb 26, 2026

Pull Request: Default Network Timeout and Ruleset Retry Logic

Summary

This PR introduces a default network_timeout and a configurable ruleset_id_list_retry_limit to the Ruby SDK to improve reliability during transient network instability.

  1. Default Network Timeout: Set to 30 seconds. Previously, this defaulted to nil, which could lead to hung requests in certain environments depending on the underlying HTTP client's behavior.
  2. Ruleset & ID List Retries: Introduced ruleset_id_list_retry_limit (defaulting to 3) to ensure that fetching configuration specs and ID lists is resilient to intermittent failures.
  3. Alphabetical Organization: Reordered StatsigOptions and the corresponding documentation table alphabetically for better maintainability.

Resilience to Cloudflare 520 Errors

Cloudflare 520 ("Web Server Returned an Unknown Error") is a catch-all for unexpected responses from the origin. These are often transient and can occur during brief periods of high latency or socket hangs.

  • Timeout Protection: By enforcing a 30s network_timeout, we ensure that if a Cloudflare edge or the origin hangs indefinitely (common in 520 scenarios), the SDK will proactively close the connection rather than waiting indefinitely.
  • Automatic Recovery: Since 520 errors are frequently intermittent, the new retry logic allows the SDK to immediately attempt a fresh request. This significantly increases the probability of a successful initialization or sync without requiring manual intervention or app restarts.

Stability via Backoff and Jitter

The SDK utilizes an exponential backoff strategy with added jitter for these retries:

  • Exponential Backoff: When a request fails, the SDK waits for an increasing amount of time before retrying (backoff * @backoff_multiplier). This prevents "thundering herd" issues where a recovering service is immediately overwhelmed by a flood of simultaneous retries from all SDK instances.
  • Jitter: By incorporating randomness into the sleep interval (seen in Network#request), we ensure that multiple distributed Ruby processes don't synchronize their retry attempts. This spreads the load over time, providing the network layer and Statsig's infrastructure a better window to stabilize and process requests successfully.

Test Plan

  • Verified that StatsigOptions defaults network_timeout to 30 and ruleset_id_list_retry_limit to 3.
  • Added assertions to test/test_network_timeout.rb to confirm default timeout behavior.
  • Added test_ruleset_id_list_retries to test/test_network.rb to verify that the SDK correctly retries failed config fetches up to the specified limit.
  • Ran existing test suite to ensure no regressions in network handling.

@stephenreid
Copy link
Author

stephenreid commented Feb 26, 2026

@lfoster-statsig I explored the SDK some, the default parameter for ruleset download retries is currently 0; this can help failed initialization.

@stephenreid
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant