feat: improve breaker by combining error rate and latency by kevwan · Pull Request #5366 · zeromicro/go-zero

kevwan · 2026-01-08T15:48:02Z

Overview

This PR enhances the Google SRE circuit breaker with latency-aware rejection, combining traditional error-based breaking with adaptive latency thresholds to protect services from performance degradation.

Motivation

Traditional circuit breakers only track error rates, missing scenarios where services become slow but don't fail. This can lead to:

Cascading slowdowns as slow instances consume resources
Timeout exhaustion before circuit opens
Resource starvation from hanging connections

Technical Implementation

1. Latency Tracking Architecture

The circuit breaker now tracks two latency metrics using exponential moving averages (EMA):

Baseline Latency (No-Load Latency)

Represents the service's optimal performance when healthy
Fast Decay (factor: 4): Quickly adapts to performance improvements
Slow Rise (factor: 100): Resistant to temporary spikes
Formula:
- Decay: baseline = (baseline + latency) / 4 when latency < baseline
- Rise: baseline = (baseline * 100 + latency) / 100 when latency > baseline

Current Latency

Tracks recent average latency using EMA
Factor: 4 - balances responsiveness with stability
Formula: current = (latency + 3 * current) / 4

2. Rejection Decision Logic

// Calculate both error and latency ratios
errorRatio := calcK(history)
latencyRatio := calcLatencyRatio()
dropRatio := max(errorRatio, latencyRatio)

// Latency ratio calculation
threshold := baseline * 3              // Activation at 3x baseline
ceiling := timeout * 0.95              // Cap at 95% of timeout
ratio := (current - threshold) / (ceiling - threshold) * 0.3  // Max 30% drop

Key Parameters:

Activation Multiplier: 3x baseline (triggered when current latency significantly exceeds normal)
Ceiling Ratio: 0.95 (95% of timeout value)
Max Drop Ratio: 0.3 (latency can contribute up to 30% rejection probability)

3. Comprehensive Testing

Added 10 new test scenarios covering:

✅ Latency tracking disabled when timeout=0
✅ Baseline initialization and decay/rise behavior
✅ Current latency EMA calculations
✅ Latency ratio calculations with various thresholds
✅ Edge cases (negative values, ceiling < threshold)
✅ End-to-end latency tracking behavior
✅ Rejection behavior under high latency

Test Coverage: Improved from 98.1% → 99.0%

4. Real-World Simulation Results

Created comprehensive simulation (adhoc/breaker-sim/) demonstrating behavior:

Scenario: Sudden Latency Spike (5ms → 140ms)

Phase 1 (Normal): 50 requests, 0 rejected
Phase 2 (Spike): 100 requests, 10 rejected (10%), peak 30% in some windows
Phase 3 (Recovery): 50 requests, 0 rejected

Key Finding: The breaker effectively detects sudden performance degradation with 10-30% rejection rate, while allowing gradual changes to adapt (0% rejection for slow ramps).

Benefits

Proactive Protection: Rejects requests before timeouts occur
Resource Efficiency: Prevents slow instances from hogging connections
Graceful Degradation: Partial rejection (max 30%) allows system recovery
No False Positives: Adaptive baseline prevents breaking during normal load increases
Complements Error Breaking: Combined error + latency provides comprehensive protection

Configuration

Latency-aware breaking is opt-in via timeout parameter:

// Enable latency-aware breaking with 100ms timeout
breaker := breaker.NewBreaker(breaker.WithTimeout(100 * time.Millisecond))

// Traditional error-only breaking (default)
breaker := breaker.NewBreaker()

Backward Compatibility

✅ Fully backward compatible:

Default behavior unchanged (no timeout = no latency tracking)
Existing breakers continue working without modification
Opt-in activation via WithTimeout option

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copilot

Pull request overview

This pull request enhances the Google Breaker circuit breaker implementation by adding latency-aware request rejection alongside the existing error-rate-based mechanism. The breaker now tracks both baseline (no-load) and current latencies using exponential moving averages and combines error ratio with latency ratio to determine the final drop probability.

Key changes:

Adds latency tracking with baseline and current latency metrics updated on successful requests
Combines error-based and latency-based drop ratios using max() to determine rejection probability
Introduces a WithTimeout option to enable latency-aware circuit breaking

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File	Description
core/breaker/googlebreaker.go	Implements latency tracking fields and algorithms; adds calcLatencyRatio, updateBaselineLatency, updateCurrentLatency methods; modifies markSuccess to accept latency parameter
core/breaker/breaker.go	Adds timeout field to circuitBreaker struct and WithTimeout option function; passes timeout to newGoogleBreaker constructor
core/breaker/googlebreaker_test.go	Adds comprehensive test coverage for latency tracking, ratio calculations, and combined error/latency scenarios; updates existing tests to pass 0 latency parameter
core/breaker/breaker_test.go	Adds tests for WithTimeout option and nopPromise coverage

Copilot · 2026-01-08T15:53:37Z

core/breaker/breaker.go

+func WithTimeout(timeout time.Duration) Option {
+	return func(b *circuitBreaker) {
+		b.timeout = timeout
+	}
+}


The WithTimeout option function lacks a documentation comment. According to go-zero guidelines, exported functions should be documented. This option is part of the public API and users need to understand that it enables latency-aware circuit breaking.

Copilot · 2026-01-08T15:53:38Z

core/breaker/googlebreaker.go

+func (b *googleBreaker) updateBaselineLatency(latencyUs int64) {
+	noLoadLatency := atomic.LoadInt64(&b.noLoadLatencyUs)
+	if noLoadLatency <= 0 {
+		atomic.StoreInt64(&b.noLoadLatencyUs, latencyUs)
+		return
+	}

-	b.stat.Reduce(func(b *bucket) {
-		result.accepts += b.Success
-		result.total += b.Sum
-		if b.Failure > 0 {
-			result.workingBuckets = 0
-		} else if b.Success > 0 {
-			result.workingBuckets++
-		}
-		if b.Success > 0 {
-			result.failingBuckets = 0
-		} else if b.Failure > 0 {
-			result.failingBuckets++
-		}
-	})
+	var newBaseline int64
+	if latencyUs < noLoadLatency {
+		// Fast decay when latency decreases
+		newBaseline = (latencyUs + (latencyBaselineDecayFactor-1)*noLoadLatency) / latencyBaselineDecayFactor
+	} else {
+		// Slow rise when latency increases
+		newBaseline = (latencyUs + (latencyBaselineRiseFactor-1)*noLoadLatency) / latencyBaselineRiseFactor
+	}

-	return result
+	atomic.StoreInt64(&b.noLoadLatencyUs, newBaseline)
+}
+
+func (b *googleBreaker) updateCurrentLatency(latencyUs int64) {
+	currentLatency := atomic.LoadInt64(&b.currentLatencyUs)
+	if currentLatency <= 0 {
+		atomic.StoreInt64(&b.currentLatencyUs, latencyUs)
+		return
+	}
+
+	// Exponential moving average: newCurrent = (latencyUs + (factor-1) * oldCurrent) / factor
+	newCurrent := (latencyUs + (latencyCurrentFactor-1)*currentLatency) / latencyCurrentFactor
+	atomic.StoreInt64(&b.currentLatencyUs, newCurrent)
+}


There's a potential race condition in updateBaselineLatency and updateCurrentLatency. Between the Load and Store operations, another goroutine could modify the value, causing the calculation to be based on stale data. This is a classic read-modify-write race condition. Consider using atomic.CompareAndSwapInt64 in a loop to ensure atomic read-modify-write, or use a mutex to protect the entire operation.

Copilot · 2026-01-08T15:53:38Z

core/breaker/googlebreaker_test.go

+
+		// Now simulate high latency that exceeds threshold
+		// threshold = 1000 * 3 = 3000us
+		// ceiling = 100000 * 0.8 = 80000us


The comment states "ceiling = 100000 * 0.8 = 80000us" but the actual constant used in the code is latencyCeilingRatio = 0.95, not 0.8. This means the ceiling should be "100000 * 0.95 = 95000us". The comment should be corrected to match the actual implementation.

Suggested change

// ceiling = 100000 * 0.8 = 80000us

// ceiling = 100000 * 0.95 = 95000us

Copilot · 2026-01-08T15:53:38Z

core/breaker/googlebreaker.go

+		timeoutUs        int64
+		noLoadLatencyUs  int64


The new fields in the googleBreaker struct (timeoutUs, noLoadLatencyUs, currentLatencyUs) lack documentation comments explaining their purpose. According to go-zero coding guidelines, exported and important struct fields should be documented. These fields track latency metrics for the latency-aware circuit breaking feature and should be documented to help developers understand their purpose.

Suggested change

timeoutUs int64

noLoadLatencyUs int64

// timeoutUs is the configured timeout in microseconds used as an upper bound when evaluating latency.

timeoutUs int64

// noLoadLatencyUs records the baseline request latency in microseconds under low-load conditions.

noLoadLatencyUs int64

// currentLatencyUs records the recent observed request latency in microseconds for latency-aware breaking.

Copilot · 2026-01-08T15:53:39Z

core/breaker/googlebreaker.go

+	window                      = time.Second * 10
+	buckets                     = 40
+	forcePassDuration           = time.Second
+	k                           = 1.5
+	minK                        = 1.1
+	protection                  = 5
+	latencyActivationMultiplier = 3
+	latencyCeilingRatio         = 0.95
+	latencyBaselineDecayFactor  = 4
+	latencyBaselineRiseFactor   = 100
+	latencyCurrentFactor        = 4
+	latencyMaxDropRatio         = 0.3


The new constants related to latency tracking (latencyActivationMultiplier, latencyCeilingRatio, latencyBaselineDecayFactor, latencyBaselineRiseFactor, latencyCurrentFactor, latencyMaxDropRatio) lack inline comments explaining their purpose and how they affect the latency-aware circuit breaking behavior. These are magic numbers that significantly impact the breaker's behavior. Adding comments would improve maintainability and help developers understand the algorithm's tuning parameters.

Suggested change

window = time.Second * 10

buckets = 40

forcePassDuration = time.Second

k = 1.5

minK = 1.1

protection = 5

latencyActivationMultiplier = 3

latencyCeilingRatio = 0.95

latencyBaselineDecayFactor = 4

latencyBaselineRiseFactor = 100

latencyCurrentFactor = 4

latencyMaxDropRatio = 0.3

window = time.Second * 10

buckets = 40

forcePassDuration = time.Second

k = 1.5

minK = 1.1

protection = 5

// latencyActivationMultiplier controls how many times higher than the

// no-load latency the observed latency must be before latency-aware

// circuit breaking starts to take effect.

latencyActivationMultiplier = 3

// latencyCeilingRatio is the upper bound ratio applied when updating the

// baseline latency estimate to prevent it from growing without limit.

latencyCeilingRatio = 0.95

// latencyBaselineDecayFactor controls how fast the baseline latency

// decreases when the system recovers; larger values mean slower decay.

latencyBaselineDecayFactor = 4

// latencyBaselineRiseFactor controls how fast the baseline latency

// increases when sustained higher latencies are observed; larger values

// mean faster convergence to a new, higher baseline.

latencyBaselineRiseFactor = 100

// latencyCurrentFactor is the weight of the most recent latency

// measurement when smoothing the current latency estimate.

latencyCurrentFactor = 4

// latencyMaxDropRatio limits how much the allowed latency can drop in a

// single adjustment step, avoiding overly aggressive reductions.

latencyMaxDropRatio = 0.3

Signed-off-by: kevin <wanjunfeng@gmail.com>

Copilot AI review requested due to automatic review settings January 8, 2026 15:48

Copilot started reviewing on behalf of kevwan January 8, 2026 15:48 View session

Copilot AI reviewed Jan 8, 2026

View reviewed changes

feat: improve breaker by combining error rate and latency

63693d6

Signed-off-by: kevin <wanjunfeng@gmail.com>

kevwan force-pushed the feat/instance-breaker branch from 47df2e4 to 63693d6 Compare January 9, 2026 15:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: improve breaker by combining error rate and latency#5366

feat: improve breaker by combining error rate and latency#5366
kevwan wants to merge 1 commit intozeromicro:masterfrom
kevwan:feat/instance-breaker

kevwan commented Jan 8, 2026 •

edited

Loading

Uh oh!

codecov bot commented Jan 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 8, 2026

Uh oh!

Copilot AI Jan 8, 2026

Uh oh!

Copilot AI Jan 8, 2026

Uh oh!

Copilot AI Jan 8, 2026

Uh oh!

Copilot AI Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	// ceiling = 100000 * 0.8 = 80000us
	// ceiling = 100000 * 0.95 = 95000us

Uh oh!

Conversation

kevwan commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Motivation

Technical Implementation

1. Latency Tracking Architecture

Baseline Latency (No-Load Latency)

Current Latency

2. Rejection Decision Logic

3. Comprehensive Testing

4. Real-World Simulation Results

Benefits

Configuration

Backward Compatibility

Related

Uh oh!

codecov bot commented Jan 8, 2026

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kevwan commented Jan 8, 2026 •

edited

Loading