fix(store): use RWMutex by siavashs · Pull Request #4808 · prometheus/alertmanager

siavashs · 2025-12-12T12:23:57Z

The alert store workload is read-heavy, so use RWMutex to allow reads and write in parallel.

Store usage analysis:

provider/mem/mem.go
- Reads
  - List() x 4 (Subscribe, SlurpAndSubscribe, GetPending, countByState)
  - Get() x 2
- Writes
  - Set() x 1
  - GC() via gcLoop
dispatch/dispatch.go (aggrGroup)
- Reads
  - List() x 2
  - Get() x 1
  - Empty() x 1
- Writes
  - Set() x 1
  - DeleteIfNotModified() x 1
inhibit/inhibit.go
- Reads
  - Get() x 2
- Writes
  - Set() x 1

The alert store workload is read-heavy, so use RWMutex to allow reads and write in parallel. Store usage analysis: provider/mem/mem.go: Reads: - List() x 4 (Subscribe, SlurpAndSubscribe, GetPending, countByState) - Get() x 2 Writes: - Set() x 1 - GC() via gcLoop dispatch/dispatch.go (aggrGroup): Reads: - List() x 2 - Get() x 1 - Empty()x 1 Writes: - Set() x 1 - DeleteIfNotModified()x 1 inhibit/inhibit.go: Reads: - Get() x 2 Writes: - Set() x 1 Signed-off-by: Siavash Safi <siavash@cloudflare.com>

Spaceman1701

My intuition is that this will improve performance, but I'm not sure. RWMutex can be slower when there's a lot of write contention, which could happen in the alert ingestion flow.

Do you have any benchmarks or profiles of this change?

siavashs · 2025-12-12T16:28:49Z

My intuition is that this will improve performance, but I'm not sure. RWMutex can be slower when there's a lot of write contention, which could happen in the alert ingestion flow.

Do you have any benchmarks or profiles of this change?

Generated some benchmarks:

goos: darwin
goarch: arm64
pkg: github.com/prometheus/alertmanager/store
cpu: Apple M3 Pro
                              │  mutex.txt  │             rwmutex.txt              │
                              │   sec/op    │    sec/op      vs base               │
Get-12                          5.865n ± 1%    5.565n ±  1%   -5.12% (p=0.001 n=7)
Set-12                          102.2n ± 2%    103.6n ±  1%   +1.37% (p=0.026 n=7)
List/100_alerts-12              172.0n ± 3%    169.6n ±  4%        ~ (p=0.090 n=7)
List/1000_alerts-12             174.2n ± 5%    170.5n ±  5%        ~ (p=0.710 n=7)
List/10000_alerts-12            173.1n ± 7%    172.6n ±  5%        ~ (p=0.557 n=7)
ConcurrentReads-12              76.64n ± 1%   130.30n ± 40%  +70.02% (p=0.001 n=7)
ConcurrentWrites-12             258.9n ± 4%    251.6n ± 10%   -2.82% (p=0.026 n=7)
ConcurrentList/100_alerts-12    286.3n ± 1%    146.0n ± 27%  -49.00% (p=0.001 n=7)
ConcurrentList/1000_alerts-12   286.0n ± 1%    199.4n ± 28%  -30.28% (p=0.001 n=7)
MixedReadWrite/50pct_reads-12   169.6n ± 2%    147.9n ±  3%  -12.79% (p=0.001 n=7)
MixedReadWrite/90pct_reads-12   92.64n ± 1%    57.49n ±  2%  -37.94% (p=0.001 n=7)
MixedReadWrite/99pct_reads-12   78.78n ± 1%    37.83n ± 11%  -51.98% (p=0.001 n=7)
geomean                         116.5n         98.97n        -15.05%

Interestingly concurrent reads are 70% slower on a read only workload.
Concurrent writes are a little bit faster.
The rest of concurrent and mixed workloads seem quite faster.

Benchmark code

// Benchmarks for comparing Mutex vs RWMutex performance.

func newBenchmarkAlert(i int) *types.Alert {
	return &types.Alert{
		Alert: model.Alert{
			Labels: model.LabelSet{
				"alertname": model.LabelValue("test"),
				"instance":  model.LabelValue(string(rune('a' + i%26))),
			},
		},
		UpdatedAt: time.Now(),
	}
}

// BenchmarkGet measures single-threaded Get performance.
func BenchmarkGet(b *testing.B) {
	s := NewAlerts()
	// Pre-populate with alerts.
	for i := range 1000 {
		_ = s.Set(newBenchmarkAlert(i))
	}
	alert := newBenchmarkAlert(500)
	fp := alert.Fingerprint()

	b.ResetTimer()
	for range b.N {
		_, _ = s.Get(fp)
	}
}

// BenchmarkSet measures single-threaded Set performance.
func BenchmarkSet(b *testing.B) {
	s := NewAlerts()
	alert := newBenchmarkAlert(0)

	b.ResetTimer()
	for range b.N {
		_ = s.Set(alert)
	}
}

// BenchmarkList measures single-threaded List performance.
func BenchmarkList(b *testing.B) {
	for _, size := range []int{100, 1000, 10000} {
		b.Run(fmt.Sprintf("%d_alerts", size), func(b *testing.B) {
			s := NewAlerts()
			for i := range size {
				_ = s.Set(newBenchmarkAlert(i))
			}

			b.ResetTimer()
			for range b.N {
				_ = s.List()
			}
		})
	}
}

// BenchmarkConcurrentReads measures concurrent read performance.
// This is where RWMutex should outperform Mutex.
func BenchmarkConcurrentReads(b *testing.B) {
	s := NewAlerts()
	for i := range 1000 {
		_ = s.Set(newBenchmarkAlert(i))
	}
	alert := newBenchmarkAlert(500)
	fp := alert.Fingerprint()

	b.ResetTimer()
	b.RunParallel(func(pb *testing.PB) {
		for pb.Next() {
			_, _ = s.Get(fp)
		}
	})
}

// BenchmarkConcurrentWrites measures concurrent write performance.
func BenchmarkConcurrentWrites(b *testing.B) {
	s := NewAlerts()
	alerts := make([]*types.Alert, 100)
	for i := range alerts {
		alerts[i] = newBenchmarkAlert(i)
	}

	b.ResetTimer()
	b.RunParallel(func(pb *testing.PB) {
		i := 0
		for pb.Next() {
			_ = s.Set(alerts[i%len(alerts)])
			i++
		}
	})
}

// BenchmarkConcurrentList measures concurrent List performance.
func BenchmarkConcurrentList(b *testing.B) {
	for _, size := range []int{100, 1000} {
		b.Run(fmt.Sprintf("%d_alerts", size), func(b *testing.B) {
			s := NewAlerts()
			for i := range size {
				_ = s.Set(newBenchmarkAlert(i))
			}

			b.ResetTimer()
			b.RunParallel(func(pb *testing.PB) {
				for pb.Next() {
					_ = s.List()
				}
			})
		})
	}
}

// BenchmarkMixedReadWrite simulates realistic workloads with different read/write ratios.
func BenchmarkMixedReadWrite(b *testing.B) {
	for _, readPct := range []int{50, 90, 99} {
		b.Run(fmt.Sprintf("%dpct_reads", readPct), func(b *testing.B) {
			s := NewAlerts()
			for i := range 1000 {
				_ = s.Set(newBenchmarkAlert(i))
			}
			alert := newBenchmarkAlert(500)
			fp := alert.Fingerprint()

			b.ResetTimer()
			b.RunParallel(func(pb *testing.PB) {
				i := 0
				for pb.Next() {
					if i%100 < readPct {
						_, _ = s.Get(fp)
					} else {
						_ = s.Set(alert)
					}
					i++
				}
			})
		})
	}
}

siavashs · 2025-12-12T16:41:46Z

After further analysis, it seems ConcurrentReads is slower since the overhead of RWMutex RLock/RUnlock is more than the map lookup that is happening. The high variance of 40% points to scheduler affecting the performance when many routines are fighting for the lock.

We can switch Get() to use Lock/Unlock instead of RLock/RUnlock but in real world scenario really depends how much other components hit Get() in parallel.

Spaceman1701 · 2025-12-12T16:51:28Z

That's an interesting result!

I guess using RLock vs Lock in Get also depends on how much contention is caused by calls to Get - e.g. is it work making each call to Get 50% slower but allowing List to happen at the same time...

As far as how to move forward here, I don't know for sure. My concern is that we could accidentally introduce a performance regression without realizing it. I think we need some "real world" data before we can be confident that this helps more than in hurts.

siavashs · 2025-12-12T16:54:18Z

I can test this in a local lab environment next week, and in production next year with the canary instance and compare performance.

Spaceman1701 · 2025-12-12T17:23:09Z

I can test this in a local lab environment next week, and in production next year with the canary instance and compare performance.

That'd make me feel a lot more comfortable with the change. Thanks!

siavashs added the component/storage label Dec 12, 2025

siavashs requested review from Spaceman1701 and ultrotter December 12, 2025 12:26

Spaceman1701 reviewed Dec 12, 2025

View reviewed changes

siavashs marked this pull request as draft January 14, 2026 14:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(store): use RWMutex#4808

fix(store): use RWMutex#4808
siavashs wants to merge 1 commit intoprometheus:mainfrom
siavashs:fix/store-rwmutex

siavashs commented Dec 12, 2025

Uh oh!

Spaceman1701 left a comment

Uh oh!

siavashs commented Dec 12, 2025 •

edited

Loading

Uh oh!

siavashs commented Dec 12, 2025 •

edited

Loading

Uh oh!

Spaceman1701 commented Dec 12, 2025

Uh oh!

siavashs commented Dec 12, 2025 •

edited

Loading

Uh oh!

Spaceman1701 commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

siavashs commented Dec 12, 2025

Uh oh!

Spaceman1701 left a comment

Choose a reason for hiding this comment

Uh oh!

siavashs commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

siavashs commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Spaceman1701 commented Dec 12, 2025

Uh oh!

siavashs commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Spaceman1701 commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

siavashs commented Dec 12, 2025 •

edited

Loading

siavashs commented Dec 12, 2025 •

edited

Loading

siavashs commented Dec 12, 2025 •

edited

Loading