Skip to content

JetStream: DeliverLastPerSubject with MaxMsgsPerSubject causes stream lock-up on sparse sequence ranges #7821

@antonbarchukov

Description

@antonbarchukov

Observed behavior

When creating a DeliverLastPerSubject consumer on a stream with MaxMsgsPerSubject set, if the stream has accumulated a large sequence range relative to its actual message count (common over time due to overwrites), the following happens:

  • Consumer creation pins an entire vCPU core for the duration of the initial delivery
  • All writes to the stream are blocked while the consumer is scanning
  • nats stream info and other CLI commands time out
  • The stream is effectively unusable until the consumer finishes initializing
    In our case we have ~400K live messages across a ~1.5M sequence range (first seq ~20K, last seq ~1.5M). Creating a consumer takes the stream down for the entire time it takes to linearly scan through that gap.

pprof confirms loadNextMsgLocked at ~50% cumulative CPU, with mapaccess2_fast64 (the map lookups inside the linear scan) right behind it at ~48%.

This happens because selectStartingSeqNo has a fast path for MaxMsgsPerSubject == 1 that skips building the lastSeqSkipList. Without the skip list, every message delivery goes through LoadNextMsg → loadNextMsgLocked, which iterates for nseq := fseq; nseq <= lseq doing map lookups across the entire sparse range.

We also found a secondary issue: when we patched the above locally to always build the skip list, delivery is fast (~140K msg/s) until the consumer reaches entries that were overwritten by concurrent producers during delivery.

Each stale entry causes getNextMsg to return ErrStoreMsgNotFound, which sends the consumer to waitForMsgs where it blocks on mch. It only wakes up when the next live message arrives in the stream, processes one skip list entry, and if that's also stale, goes back to sleep.

Delivery rate drops from ~140K msg/s to ~500 msg/s for the tail end of the initial delivery:
0-5s: ~140K msg/s (valid skip list entries)
5-19s: ~500 msg/s (stale entries, consumer sleeping between live publish signals)
19-20s: resumes and finishes

Without concurrent producers, delivery is smooth after building the skip list.

server/consumer.go

  • Line ~5873 — selectStartingSeqNo: the mmp == 1 fast path that skips building the skip list
  • Lines 4581-4599 — getNextMsg: skip list delivery that returns ErrStoreMsgNotFound on stale entries instead of looping to next
  • Lines 5032-5048 — loopAndGatherMsgs: error handling that sends consumer to waitForMsgs on ErrStoreMsgNotFound
  • Line 5187 — waitForMsgs select block where the consumer sleeps

server/memstore.go

Expected behavior

Consumer creation should not block writes to the stream or cause CLI timeouts. Delivery of initial state should scale with message count, not sequence range. Stale skip list entries should be skipped over immediately rather than causing the consumer to sleep.

Server and client version

nats-server built from main branch

Host environment

8-core ARM64 (AWS c7g.2xlarge, Graviton 3)
In-memory store (memstore)
Stream config: MaxMsgsPerSubject=1, ~400K subjects, ~1-2M sequence range

Tested with both raw streams using MaxMsgsPerSubject and KV buckets with History=1 — same behavior

Steps to reproduce

  1. Create a stream with MaxMsgsPerSubject set (or a KV bucket with any history value)
  2. Publish to many subjects repeatedly over time so the sequence range grows well beyond the live message count. We used a script that publishes to 400K unique subjects (though as this proved later, doesn't quite matter as much as the sheer number of sequencing numbers) to inflate the sequence range to ~1-2M while retaining ~400K live messages, but this also happens naturally over hours/days of production use.
  3. Create a DeliverLastPerSubject consumer (or run nats kv watch) — stream locks up, writes block, CLI times out
  4. pprof CPU profile shows loadNextMsgLocked pinning a core

Related: #6687 (fixed the write path in removeSeqPerSubject but the consumer read path still has the same class of issue)

Let me know if pprofs are useful, I can share in thread/pm.

Metadata

Metadata

Assignees

Labels

defectSuspected defect such as a bug or regression

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions