-
-
Notifications
You must be signed in to change notification settings - Fork 48
Clock Skew and Stoplight Reliability
Clock skew refers to the difference in time between different systems or servers. In distributed circuit breaker implementations like Stoplight, clock skew can significantly impact the reliability and consistency of circuit breaker decisions across multiple application instances.
Stoplight uses a distributed, leader-less architecture where multiple application instances coordinate circuit breaker state through Redis. This coordination relies heavily on time-based mechanisms:
- Time-Bucketed Metrics: Success/failure counts are stored in Redis ZSETs using timestamps as scores
- Window-Based Calculations: Error rates are calculated over sliding time windows
- State Transitions: Circuit breaker state changes (green → red → yellow) are time-coordinated
When application servers have different system times, several issues could happen:
# Server A (time: 10:00:00) records a failure
redis.zadd("circuit:failures", 1609459200, "failure_1")
# Server B (time: 9:59:50, 10 seconds behind) queries for last 60 seconds
# It looks for scores >= (1609459200 - 60) but misses recent failures
# from Server A because its local time calculation is offset- Server A might transition to the yellow (half-open) state at 10:05:00
- Server B (running 10 seconds behind) transitions at 10:05:10 local time
- This creates a 10-second window where servers disagree on state
Timeline with 5-second skew:
-
T+0: Server A sees high error rate, transitions to RED -
T+3: Server B (lagging) still sees circuit as GREEN, allows traffic -
T+5: Server B finally sees the same error rate, transitions to RED
Result: 3-5 seconds of inconsistent behavior
Stoplight is designed to handle minor clock differences gracefully. Small amounts of clock skew (typically under 2-5 seconds) usually don't cause significant issues because:
- Statistical Smoothing: Error rate calculations over larger windows (60+ seconds) naturally smooth out small timing inconsistencies
- Probabilistic Nature: Circuit breaker decisions are based on statistical thresholds rather than exact counts
However, larger clock skews can still cause problems that affect system reliability and consistency.
- Some servers may continue sending traffic to failing services while others have already opened the circuit
- Servers attempt recovery at different times, potentially overwhelming the downstream service
- Different servers report different error rates for the same time period
Stoplight's Redis data store is shipped with an automatic clock skew detection that helps identify misconfigured servers. The skew detection runs periodically and compares local time with the Redis server time. If a significant clock skew is detected (default: more than 5 seconds), it produces a warning:
Detected clock skew between Redis and the application server. Redis time: 1609459200, Application time: 1609459195
The recommended approach for solving the clock skew issue is to configure NTP or a similar time synchronization on all servers in your deployment:
- Application servers running Stoplight
- Redis servers storing circuit breaker data
For most applications, small clock skew is tolerable, but proper time sync is still recommended for consistency and to avoid edge cases. Most modern Linux distributions include time synchronization by default. Verify it's enabled and working properly. See the Further Reading section below for detailed NTP configuration instructions.
If you have a specific deployment where clock skew warnings are not useful (e.g., development or testing environments with known skew), you can disable them:
# Initialize Redis data store without clock skew warnings
Stoplight.configure do |config|
config.data_store = Stoplight::DataStore::Redis.new(redis, warn_on_clock_skew: false)
end