feat: improve breaker by combining error rate and latency#5366
feat: improve breaker by combining error rate and latency#5366kevwan wants to merge 1 commit intozeromicro:masterfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
This pull request enhances the Google Breaker circuit breaker implementation by adding latency-aware request rejection alongside the existing error-rate-based mechanism. The breaker now tracks both baseline (no-load) and current latencies using exponential moving averages and combines error ratio with latency ratio to determine the final drop probability.
Key changes:
- Adds latency tracking with baseline and current latency metrics updated on successful requests
- Combines error-based and latency-based drop ratios using max() to determine rejection probability
- Introduces a
WithTimeoutoption to enable latency-aware circuit breaking
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| core/breaker/googlebreaker.go | Implements latency tracking fields and algorithms; adds calcLatencyRatio, updateBaselineLatency, updateCurrentLatency methods; modifies markSuccess to accept latency parameter |
| core/breaker/breaker.go | Adds timeout field to circuitBreaker struct and WithTimeout option function; passes timeout to newGoogleBreaker constructor |
| core/breaker/googlebreaker_test.go | Adds comprehensive test coverage for latency tracking, ratio calculations, and combined error/latency scenarios; updates existing tests to pass 0 latency parameter |
| core/breaker/breaker_test.go | Adds tests for WithTimeout option and nopPromise coverage |
| func WithTimeout(timeout time.Duration) Option { | ||
| return func(b *circuitBreaker) { | ||
| b.timeout = timeout | ||
| } | ||
| } |
There was a problem hiding this comment.
The WithTimeout option function lacks a documentation comment. According to go-zero guidelines, exported functions should be documented. This option is part of the public API and users need to understand that it enables latency-aware circuit breaking.
| func (b *googleBreaker) updateBaselineLatency(latencyUs int64) { | ||
| noLoadLatency := atomic.LoadInt64(&b.noLoadLatencyUs) | ||
| if noLoadLatency <= 0 { | ||
| atomic.StoreInt64(&b.noLoadLatencyUs, latencyUs) | ||
| return | ||
| } | ||
|
|
||
| b.stat.Reduce(func(b *bucket) { | ||
| result.accepts += b.Success | ||
| result.total += b.Sum | ||
| if b.Failure > 0 { | ||
| result.workingBuckets = 0 | ||
| } else if b.Success > 0 { | ||
| result.workingBuckets++ | ||
| } | ||
| if b.Success > 0 { | ||
| result.failingBuckets = 0 | ||
| } else if b.Failure > 0 { | ||
| result.failingBuckets++ | ||
| } | ||
| }) | ||
| var newBaseline int64 | ||
| if latencyUs < noLoadLatency { | ||
| // Fast decay when latency decreases | ||
| newBaseline = (latencyUs + (latencyBaselineDecayFactor-1)*noLoadLatency) / latencyBaselineDecayFactor | ||
| } else { | ||
| // Slow rise when latency increases | ||
| newBaseline = (latencyUs + (latencyBaselineRiseFactor-1)*noLoadLatency) / latencyBaselineRiseFactor | ||
| } | ||
|
|
||
| return result | ||
| atomic.StoreInt64(&b.noLoadLatencyUs, newBaseline) | ||
| } | ||
|
|
||
| func (b *googleBreaker) updateCurrentLatency(latencyUs int64) { | ||
| currentLatency := atomic.LoadInt64(&b.currentLatencyUs) | ||
| if currentLatency <= 0 { | ||
| atomic.StoreInt64(&b.currentLatencyUs, latencyUs) | ||
| return | ||
| } | ||
|
|
||
| // Exponential moving average: newCurrent = (latencyUs + (factor-1) * oldCurrent) / factor | ||
| newCurrent := (latencyUs + (latencyCurrentFactor-1)*currentLatency) / latencyCurrentFactor | ||
| atomic.StoreInt64(&b.currentLatencyUs, newCurrent) | ||
| } |
There was a problem hiding this comment.
There's a potential race condition in updateBaselineLatency and updateCurrentLatency. Between the Load and Store operations, another goroutine could modify the value, causing the calculation to be based on stale data. This is a classic read-modify-write race condition. Consider using atomic.CompareAndSwapInt64 in a loop to ensure atomic read-modify-write, or use a mutex to protect the entire operation.
|
|
||
| // Now simulate high latency that exceeds threshold | ||
| // threshold = 1000 * 3 = 3000us | ||
| // ceiling = 100000 * 0.8 = 80000us |
There was a problem hiding this comment.
The comment states "ceiling = 100000 * 0.8 = 80000us" but the actual constant used in the code is latencyCeilingRatio = 0.95, not 0.8. This means the ceiling should be "100000 * 0.95 = 95000us". The comment should be corrected to match the actual implementation.
| // ceiling = 100000 * 0.8 = 80000us | |
| // ceiling = 100000 * 0.95 = 95000us |
| timeoutUs int64 | ||
| noLoadLatencyUs int64 |
There was a problem hiding this comment.
The new fields in the googleBreaker struct (timeoutUs, noLoadLatencyUs, currentLatencyUs) lack documentation comments explaining their purpose. According to go-zero coding guidelines, exported and important struct fields should be documented. These fields track latency metrics for the latency-aware circuit breaking feature and should be documented to help developers understand their purpose.
| timeoutUs int64 | |
| noLoadLatencyUs int64 | |
| // timeoutUs is the configured timeout in microseconds used as an upper bound when evaluating latency. | |
| timeoutUs int64 | |
| // noLoadLatencyUs records the baseline request latency in microseconds under low-load conditions. | |
| noLoadLatencyUs int64 | |
| // currentLatencyUs records the recent observed request latency in microseconds for latency-aware breaking. |
| window = time.Second * 10 | ||
| buckets = 40 | ||
| forcePassDuration = time.Second | ||
| k = 1.5 | ||
| minK = 1.1 | ||
| protection = 5 | ||
| latencyActivationMultiplier = 3 | ||
| latencyCeilingRatio = 0.95 | ||
| latencyBaselineDecayFactor = 4 | ||
| latencyBaselineRiseFactor = 100 | ||
| latencyCurrentFactor = 4 | ||
| latencyMaxDropRatio = 0.3 |
There was a problem hiding this comment.
The new constants related to latency tracking (latencyActivationMultiplier, latencyCeilingRatio, latencyBaselineDecayFactor, latencyBaselineRiseFactor, latencyCurrentFactor, latencyMaxDropRatio) lack inline comments explaining their purpose and how they affect the latency-aware circuit breaking behavior. These are magic numbers that significantly impact the breaker's behavior. Adding comments would improve maintainability and help developers understand the algorithm's tuning parameters.
| window = time.Second * 10 | |
| buckets = 40 | |
| forcePassDuration = time.Second | |
| k = 1.5 | |
| minK = 1.1 | |
| protection = 5 | |
| latencyActivationMultiplier = 3 | |
| latencyCeilingRatio = 0.95 | |
| latencyBaselineDecayFactor = 4 | |
| latencyBaselineRiseFactor = 100 | |
| latencyCurrentFactor = 4 | |
| latencyMaxDropRatio = 0.3 | |
| window = time.Second * 10 | |
| buckets = 40 | |
| forcePassDuration = time.Second | |
| k = 1.5 | |
| minK = 1.1 | |
| protection = 5 | |
| // latencyActivationMultiplier controls how many times higher than the | |
| // no-load latency the observed latency must be before latency-aware | |
| // circuit breaking starts to take effect. | |
| latencyActivationMultiplier = 3 | |
| // latencyCeilingRatio is the upper bound ratio applied when updating the | |
| // baseline latency estimate to prevent it from growing without limit. | |
| latencyCeilingRatio = 0.95 | |
| // latencyBaselineDecayFactor controls how fast the baseline latency | |
| // decreases when the system recovers; larger values mean slower decay. | |
| latencyBaselineDecayFactor = 4 | |
| // latencyBaselineRiseFactor controls how fast the baseline latency | |
| // increases when sustained higher latencies are observed; larger values | |
| // mean faster convergence to a new, higher baseline. | |
| latencyBaselineRiseFactor = 100 | |
| // latencyCurrentFactor is the weight of the most recent latency | |
| // measurement when smoothing the current latency estimate. | |
| latencyCurrentFactor = 4 | |
| // latencyMaxDropRatio limits how much the allowed latency can drop in a | |
| // single adjustment step, avoiding overly aggressive reductions. | |
| latencyMaxDropRatio = 0.3 |
Signed-off-by: kevin <wanjunfeng@gmail.com>
47df2e4 to
63693d6
Compare
Overview
This PR enhances the Google SRE circuit breaker with latency-aware rejection, combining traditional error-based breaking with adaptive latency thresholds to protect services from performance degradation.
Motivation
Traditional circuit breakers only track error rates, missing scenarios where services become slow but don't fail. This can lead to:
Technical Implementation
1. Latency Tracking Architecture
The circuit breaker now tracks two latency metrics using exponential moving averages (EMA):
Baseline Latency (No-Load Latency)
baseline = (baseline + latency) / 4whenlatency < baselinebaseline = (baseline * 100 + latency) / 100whenlatency > baselineCurrent Latency
current = (latency + 3 * current) / 42. Rejection Decision Logic
Key Parameters:
3. Comprehensive Testing
Added 10 new test scenarios covering:
Test Coverage: Improved from 98.1% → 99.0%
4. Real-World Simulation Results
Created comprehensive simulation (
adhoc/breaker-sim/) demonstrating behavior:Scenario: Sudden Latency Spike (5ms → 140ms)
Key Finding: The breaker effectively detects sudden performance degradation with 10-30% rejection rate, while allowing gradual changes to adapt (0% rejection for slow ramps).
Benefits
Configuration
Latency-aware breaking is opt-in via timeout parameter:
Backward Compatibility
✅ Fully backward compatible:
WithTimeoutoptionRelated
adhoc/breaker-sim/adhoc/breaker-sim/RESULTS.mdcore/breaker/googlebreaker.go:15-27