Skip to content

Clock Skew and Stoplight Reliability

Tëma Bolshakov edited this page May 26, 2025 · 1 revision

Clock skew refers to the difference in time between different systems or servers. In distributed circuit breaker implementations like Stoplight, clock skew can significantly impact the reliability and consistency of circuit breaker decisions across multiple application instances.

Stoplight uses a distributed, leader-less architecture where multiple application instances coordinate circuit breaker state through Redis. This coordination relies heavily on time-based mechanisms:

  • Time-Bucketed Metrics: Success/failure counts are stored in Redis ZSETs using timestamps as scores
  • Window-Based Calculations: Error rates are calculated over sliding time windows
  • State Transitions: Circuit breaker state changes (green → red → yellow) are time-coordinated

When application servers have different system times, several issues could happen:

1. Inconsistent Error Count Windows

# Server A (time: 10:00:00) records a failure
redis.zadd("circuit:failures", 1609459200, "failure_1")

# Server B (time: 9:59:50, 10 seconds behind) queries for last 60 seconds
# It looks for scores >= (1609459200 - 60) but misses recent failures
# from Server A because its local time calculation is offset

2. Desynchronized Recovery Timing

  • Server A might transition to the yellow (half-open) state at 10:05:00
  • Server B (running 10 seconds behind) transitions at 10:05:10 local time
  • This creates a 10-second window where servers disagree on state

3. Race Conditions in State Management

Timeline with 5-second skew:

  • T+0: Server A sees high error rate, transitions to RED
  • T+3: Server B (lagging) still sees circuit as GREEN, allows traffic
  • T+5: Server B finally sees the same error rate, transitions to RED

Result: 3-5 seconds of inconsistent behavior

Impact on Stoplight Behavior

Stoplight is designed to handle minor clock differences gracefully. Small amounts of clock skew (typically under 2-5 seconds) usually don't cause significant issues because:

  • Statistical Smoothing: Error rate calculations over larger windows (60+ seconds) naturally smooth out small timing inconsistencies
  • Probabilistic Nature: Circuit breaker decisions are based on statistical thresholds rather than exact counts

However, larger clock skews can still cause problems that affect system reliability and consistency.

  • Some servers may continue sending traffic to failing services while others have already opened the circuit
  • Servers attempt recovery at different times, potentially overwhelming the downstream service
  • Different servers report different error rates for the same time period

Detection and Monitoring

Stoplight's Redis data store is shipped with an automatic clock skew detection that helps identify misconfigured servers. The skew detection runs periodically and compares local time with the Redis server time. If a significant clock skew is detected (default: more than 5 seconds), it produces a warning:

Detected clock skew between Redis and the application server. Redis time: 1609459200, Application time: 1609459195

Solutions and Best Practices

The recommended approach for solving the clock skew issue is to configure NTP or a similar time synchronization on all servers in your deployment:

  • Application servers running Stoplight
  • Redis servers storing circuit breaker data

For most applications, small clock skew is tolerable, but proper time sync is still recommended for consistency and to avoid edge cases. Most modern Linux distributions include time synchronization by default. Verify it's enabled and working properly. See the Further Reading section below for detailed NTP configuration instructions.

If you have a specific deployment where clock skew warnings are not useful (e.g., development or testing environments with known skew), you can disable them:

# Initialize Redis data store without clock skew warnings
Stoplight.configure do |config|
  config.data_store = Stoplight::DataStore::Redis.new(redis, warn_on_clock_skew: false)
end

Further Reading