Clock Skew and Stoplight Reliability

Clock skew refers to the difference in time between different systems or servers. In distributed circuit breaker implementations like Stoplight, clock skew can significantly impact the reliability and consistency of circuit breaker decisions across multiple application instances.

Stoplight uses a distributed, leader-less architecture where multiple application instances coordinate circuit breaker state through Redis. This coordination relies heavily on time-based mechanisms:

Time-Bucketed Metrics: Success/failure counts are stored in Redis ZSETs using timestamps as scores
Window-Based Calculations: Error rates are calculated over sliding time windows
State Transitions: Circuit breaker state changes (green → red → yellow) are time-coordinated

When application servers have different system times, several issues could happen:

1. Inconsistent Error Count Windows

# Server A (time: 10:00:00) records a failure
redis.zadd("circuit:failures", 1609459200, "failure_1")

# Server B (time: 9:59:50, 10 seconds behind) queries for last 60 seconds
# It looks for scores >= (1609459200 - 60) but misses recent failures
# from Server A because its local time calculation is offset

2. Desynchronized Recovery Timing

Server A might transition to the yellow (half-open) state at 10:05:00
Server B (running 10 seconds behind) transitions at 10:05:10 local time
This creates a 10-second window where servers disagree on state

3. Race Conditions in State Management

Timeline with 5-second skew:

T+0: Server A sees high error rate, transitions to RED
T+3: Server B (lagging) still sees circuit as GREEN, allows traffic
T+5: Server B finally sees the same error rate, transitions to RED

Result: 3-5 seconds of inconsistent behavior

Impact on Stoplight Behavior

Stoplight is designed to handle minor clock differences gracefully. Small amounts of clock skew (typically under 2-5 seconds) usually don't cause significant issues because:

Statistical Smoothing: Error rate calculations over larger windows (60+ seconds) naturally smooth out small timing inconsistencies
Probabilistic Nature: Circuit breaker decisions are based on statistical thresholds rather than exact counts

However, larger clock skews can still cause problems that affect system reliability and consistency.

Some servers may continue sending traffic to failing services while others have already opened the circuit
Servers attempt recovery at different times, potentially overwhelming the downstream service
Different servers report different error rates for the same time period

Detection and Monitoring

Stoplight's Redis data store is shipped with an automatic clock skew detection that helps identify misconfigured servers. The skew detection runs periodically and compares local time with the Redis server time. If a significant clock skew is detected (default: more than 5 seconds), it produces a warning:

Detected clock skew between Redis and the application server. Redis time: 1609459200, Application time: 1609459195

Solutions and Best Practices

The recommended approach for solving the clock skew issue is to configure NTP or a similar time synchronization on all servers in your deployment:

Application servers running Stoplight
Redis servers storing circuit breaker data

For most applications, small clock skew is tolerable, but proper time sync is still recommended for consistency and to avoid edge cases. Most modern Linux distributions include time synchronization by default. Verify it's enabled and working properly. See the Further Reading section below for detailed NTP configuration instructions.

If you have a specific deployment where clock skew warnings are not useful (e.g., development or testing environments with known skew), you can disable them:

# Initialize Redis data store without clock skew warnings
Stoplight.configure do |config|
  config.data_store = Stoplight::DataStore::Redis.new(redis, warn_on_clock_skew: false)
end

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clock Skew and Stoplight Reliability

1. Inconsistent Error Count Windows

2. Desynchronized Recovery Timing

3. Race Conditions in State Management

Impact on Stoplight Behavior

Detection and Monitoring

Solutions and Best Practices

Further Reading

Uh oh!

Uh oh!

Clone this wiki locally