Skip to content

feat: improve breaker by combining error rate and latency#5366

Open
kevwan wants to merge 1 commit intozeromicro:masterfrom
kevwan:feat/instance-breaker
Open

feat: improve breaker by combining error rate and latency#5366
kevwan wants to merge 1 commit intozeromicro:masterfrom
kevwan:feat/instance-breaker

Conversation

@kevwan
Copy link
Contributor

@kevwan kevwan commented Jan 8, 2026

Overview

This PR enhances the Google SRE circuit breaker with latency-aware rejection, combining traditional error-based breaking with adaptive latency thresholds to protect services from performance degradation.

Motivation

Traditional circuit breakers only track error rates, missing scenarios where services become slow but don't fail. This can lead to:

  • Cascading slowdowns as slow instances consume resources
  • Timeout exhaustion before circuit opens
  • Resource starvation from hanging connections

Technical Implementation

1. Latency Tracking Architecture

The circuit breaker now tracks two latency metrics using exponential moving averages (EMA):

Baseline Latency (No-Load Latency)

  • Represents the service's optimal performance when healthy
  • Fast Decay (factor: 4): Quickly adapts to performance improvements
  • Slow Rise (factor: 100): Resistant to temporary spikes
  • Formula:
    • Decay: baseline = (baseline + latency) / 4 when latency < baseline
    • Rise: baseline = (baseline * 100 + latency) / 100 when latency > baseline

Current Latency

  • Tracks recent average latency using EMA
  • Factor: 4 - balances responsiveness with stability
  • Formula: current = (latency + 3 * current) / 4

2. Rejection Decision Logic

// Calculate both error and latency ratios
errorRatio := calcK(history)
latencyRatio := calcLatencyRatio()
dropRatio := max(errorRatio, latencyRatio)

// Latency ratio calculation
threshold := baseline * 3              // Activation at 3x baseline
ceiling := timeout * 0.95              // Cap at 95% of timeout
ratio := (current - threshold) / (ceiling - threshold) * 0.3  // Max 30% drop

Key Parameters:

  • Activation Multiplier: 3x baseline (triggered when current latency significantly exceeds normal)
  • Ceiling Ratio: 0.95 (95% of timeout value)
  • Max Drop Ratio: 0.3 (latency can contribute up to 30% rejection probability)

3. Comprehensive Testing

Added 10 new test scenarios covering:

  • ✅ Latency tracking disabled when timeout=0
  • ✅ Baseline initialization and decay/rise behavior
  • ✅ Current latency EMA calculations
  • ✅ Latency ratio calculations with various thresholds
  • ✅ Edge cases (negative values, ceiling < threshold)
  • ✅ End-to-end latency tracking behavior
  • ✅ Rejection behavior under high latency

Test Coverage: Improved from 98.1% → 99.0%

4. Real-World Simulation Results

Created comprehensive simulation (adhoc/breaker-sim/) demonstrating behavior:

Scenario: Sudden Latency Spike (5ms → 140ms)

  • Phase 1 (Normal): 50 requests, 0 rejected
  • Phase 2 (Spike): 100 requests, 10 rejected (10%), peak 30% in some windows
  • Phase 3 (Recovery): 50 requests, 0 rejected

Key Finding: The breaker effectively detects sudden performance degradation with 10-30% rejection rate, while allowing gradual changes to adapt (0% rejection for slow ramps).

Benefits

  1. Proactive Protection: Rejects requests before timeouts occur
  2. Resource Efficiency: Prevents slow instances from hogging connections
  3. Graceful Degradation: Partial rejection (max 30%) allows system recovery
  4. No False Positives: Adaptive baseline prevents breaking during normal load increases
  5. Complements Error Breaking: Combined error + latency provides comprehensive protection

Configuration

Latency-aware breaking is opt-in via timeout parameter:

// Enable latency-aware breaking with 100ms timeout
breaker := breaker.NewBreaker(breaker.WithTimeout(100 * time.Millisecond))

// Traditional error-only breaking (default)
breaker := breaker.NewBreaker()

Backward Compatibility

✅ Fully backward compatible:

  • Default behavior unchanged (no timeout = no latency tracking)
  • Existing breakers continue working without modification
  • Opt-in activation via WithTimeout option

Related

  • Simulation code: adhoc/breaker-sim/
  • Detailed results: adhoc/breaker-sim/RESULTS.md
  • Technical constants defined in core/breaker/googlebreaker.go:15-27

Copilot AI review requested due to automatic review settings January 8, 2026 15:48
@codecov
Copy link

codecov bot commented Jan 8, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request enhances the Google Breaker circuit breaker implementation by adding latency-aware request rejection alongside the existing error-rate-based mechanism. The breaker now tracks both baseline (no-load) and current latencies using exponential moving averages and combines error ratio with latency ratio to determine the final drop probability.

Key changes:

  • Adds latency tracking with baseline and current latency metrics updated on successful requests
  • Combines error-based and latency-based drop ratios using max() to determine rejection probability
  • Introduces a WithTimeout option to enable latency-aware circuit breaking

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File Description
core/breaker/googlebreaker.go Implements latency tracking fields and algorithms; adds calcLatencyRatio, updateBaselineLatency, updateCurrentLatency methods; modifies markSuccess to accept latency parameter
core/breaker/breaker.go Adds timeout field to circuitBreaker struct and WithTimeout option function; passes timeout to newGoogleBreaker constructor
core/breaker/googlebreaker_test.go Adds comprehensive test coverage for latency tracking, ratio calculations, and combined error/latency scenarios; updates existing tests to pass 0 latency parameter
core/breaker/breaker_test.go Adds tests for WithTimeout option and nopPromise coverage

Comment on lines +206 to +210
func WithTimeout(timeout time.Duration) Option {
return func(b *circuitBreaker) {
b.timeout = timeout
}
}
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The WithTimeout option function lacks a documentation comment. According to go-zero guidelines, exported functions should be documented. This option is part of the public API and users need to understand that it enables latency-aware circuit breaking.

Copilot uses AI. Check for mistakes.
Comment on lines 199 to 227
func (b *googleBreaker) updateBaselineLatency(latencyUs int64) {
noLoadLatency := atomic.LoadInt64(&b.noLoadLatencyUs)
if noLoadLatency <= 0 {
atomic.StoreInt64(&b.noLoadLatencyUs, latencyUs)
return
}

b.stat.Reduce(func(b *bucket) {
result.accepts += b.Success
result.total += b.Sum
if b.Failure > 0 {
result.workingBuckets = 0
} else if b.Success > 0 {
result.workingBuckets++
}
if b.Success > 0 {
result.failingBuckets = 0
} else if b.Failure > 0 {
result.failingBuckets++
}
})
var newBaseline int64
if latencyUs < noLoadLatency {
// Fast decay when latency decreases
newBaseline = (latencyUs + (latencyBaselineDecayFactor-1)*noLoadLatency) / latencyBaselineDecayFactor
} else {
// Slow rise when latency increases
newBaseline = (latencyUs + (latencyBaselineRiseFactor-1)*noLoadLatency) / latencyBaselineRiseFactor
}

return result
atomic.StoreInt64(&b.noLoadLatencyUs, newBaseline)
}

func (b *googleBreaker) updateCurrentLatency(latencyUs int64) {
currentLatency := atomic.LoadInt64(&b.currentLatencyUs)
if currentLatency <= 0 {
atomic.StoreInt64(&b.currentLatencyUs, latencyUs)
return
}

// Exponential moving average: newCurrent = (latencyUs + (factor-1) * oldCurrent) / factor
newCurrent := (latencyUs + (latencyCurrentFactor-1)*currentLatency) / latencyCurrentFactor
atomic.StoreInt64(&b.currentLatencyUs, newCurrent)
}
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a potential race condition in updateBaselineLatency and updateCurrentLatency. Between the Load and Store operations, another goroutine could modify the value, causing the calculation to be based on stale data. This is a classic read-modify-write race condition. Consider using atomic.CompareAndSwapInt64 in a loop to ensure atomic read-modify-write, or use a mutex to protect the entire operation.

Copilot uses AI. Check for mistakes.

// Now simulate high latency that exceeds threshold
// threshold = 1000 * 3 = 3000us
// ceiling = 100000 * 0.8 = 80000us
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states "ceiling = 100000 * 0.8 = 80000us" but the actual constant used in the code is latencyCeilingRatio = 0.95, not 0.8. This means the ceiling should be "100000 * 0.95 = 95000us". The comment should be corrected to match the actual implementation.

Suggested change
// ceiling = 100000 * 0.8 = 80000us
// ceiling = 100000 * 0.95 = 95000us

Copilot uses AI. Check for mistakes.
Comment on lines +38 to +39
timeoutUs int64
noLoadLatencyUs int64
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new fields in the googleBreaker struct (timeoutUs, noLoadLatencyUs, currentLatencyUs) lack documentation comments explaining their purpose. According to go-zero coding guidelines, exported and important struct fields should be documented. These fields track latency metrics for the latency-aware circuit breaking feature and should be documented to help developers understand their purpose.

Suggested change
timeoutUs int64
noLoadLatencyUs int64
// timeoutUs is the configured timeout in microseconds used as an upper bound when evaluating latency.
timeoutUs int64
// noLoadLatencyUs records the baseline request latency in microseconds under low-load conditions.
noLoadLatencyUs int64
// currentLatencyUs records the recent observed request latency in microseconds for latency-aware breaking.

Copilot uses AI. Check for mistakes.
Comment on lines 16 to 27
window = time.Second * 10
buckets = 40
forcePassDuration = time.Second
k = 1.5
minK = 1.1
protection = 5
latencyActivationMultiplier = 3
latencyCeilingRatio = 0.95
latencyBaselineDecayFactor = 4
latencyBaselineRiseFactor = 100
latencyCurrentFactor = 4
latencyMaxDropRatio = 0.3
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new constants related to latency tracking (latencyActivationMultiplier, latencyCeilingRatio, latencyBaselineDecayFactor, latencyBaselineRiseFactor, latencyCurrentFactor, latencyMaxDropRatio) lack inline comments explaining their purpose and how they affect the latency-aware circuit breaking behavior. These are magic numbers that significantly impact the breaker's behavior. Adding comments would improve maintainability and help developers understand the algorithm's tuning parameters.

Suggested change
window = time.Second * 10
buckets = 40
forcePassDuration = time.Second
k = 1.5
minK = 1.1
protection = 5
latencyActivationMultiplier = 3
latencyCeilingRatio = 0.95
latencyBaselineDecayFactor = 4
latencyBaselineRiseFactor = 100
latencyCurrentFactor = 4
latencyMaxDropRatio = 0.3
window = time.Second * 10
buckets = 40
forcePassDuration = time.Second
k = 1.5
minK = 1.1
protection = 5
// latencyActivationMultiplier controls how many times higher than the
// no-load latency the observed latency must be before latency-aware
// circuit breaking starts to take effect.
latencyActivationMultiplier = 3
// latencyCeilingRatio is the upper bound ratio applied when updating the
// baseline latency estimate to prevent it from growing without limit.
latencyCeilingRatio = 0.95
// latencyBaselineDecayFactor controls how fast the baseline latency
// decreases when the system recovers; larger values mean slower decay.
latencyBaselineDecayFactor = 4
// latencyBaselineRiseFactor controls how fast the baseline latency
// increases when sustained higher latencies are observed; larger values
// mean faster convergence to a new, higher baseline.
latencyBaselineRiseFactor = 100
// latencyCurrentFactor is the weight of the most recent latency
// measurement when smoothing the current latency estimate.
latencyCurrentFactor = 4
// latencyMaxDropRatio limits how much the allowed latency can drop in a
// single adjustment step, avoiding overly aggressive reductions.
latencyMaxDropRatio = 0.3

Copilot uses AI. Check for mistakes.
Signed-off-by: kevin <wanjunfeng@gmail.com>
@kevwan kevwan force-pushed the feat/instance-breaker branch from 47df2e4 to 63693d6 Compare January 9, 2026 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant