[KEP-5866] Server-side sharded list and watch by Jefftree · Pull Request #5867 · kubernetes/enhancements

Jefftree · 2026-02-01T22:41:47Z

One-line PR description: Shardable Watch

Issue link: Server-side Sharded List and Watch #5866

Other comments:

/sig api-machinery

xigang · 2026-02-02T04:10:09Z

/cc

wojtek-t · 2026-02-02T10:53:09Z

/assign

jpbetz · 2026-02-03T21:48:01Z

/approve
For PRR for alpha only. We still need to do KEP review.

keps/sig-api-machinery/5866-server-side-sharded-list-and-watch/README.md

keps/sig-api-machinery/5866-shardable-watch/README.md

jpbetz · 2026-02-03T23:11:31Z

@deads2k @sttts

keps/sig-api-machinery/5866-server-side-sharded-list-and-watch/README.md

k8s-ci-robot · 2026-02-04T18:09:56Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Jefftree, jpbetz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [jpbetz]
~~keps/sig-api-machinery/OWNERS~~ [jpbetz]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

liggitt · 2026-02-04T20:38:28Z

keps/sig-api-machinery/5866-shardable-watch/README.md

+  sharding. Given the motivation is optimization, receiving the full stream is a safe fallback
+  (client-side filtering should remain in place or be conditionally disabled).


As a sharded client, getting things outside the shard I requested is a really sharp edge. Is there a more deterministic way to recognize immediately that a server doesn't support sharding and client-side filtering is needed? Are we expecting to bake in a uid and namespace hash-based filter implementation that matches the server into the client library?

Hm, the client could query the OpenAPI endpoint to see if the new parameter exists. They can gate based on server version but there is always the risk of the feature being turned off. Many projects already do client-side filtering, and I'd imagine just exposing our hash algorithm and comparison logic in a common library (eg: apimachinery) is enough without being attached to a specific field.

wojtek-t · 2026-02-06T13:58:10Z

keps/sig-api-machinery/5866-shardable-watch/README.md

+
+- Wasted network bandwidth (N replicas * Full Stream).
+- Wasted CPU and Memory on clients for processing irrelevant events.
+- Increased lock contention on the API Server side serialized same events for multiple watchers.


nit: When the same object is sent to multiple watchers, it actually is serialized once (per serialization format) - ref https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/1152-less-object-serializations
Also, I'm not sure we have lock-contention issues related to it.

Instead, where we have problems are:

resource consumption related to the fact that we actually need to send all that data

throughput at the watchcache layer ("first line" event dispatching is single-threaded and we iterate over all watchers interested in a given event to put it into appropriate channel); so the throughput that we can achieve directly depends on the number of those watchers

oops, I was trying to say informer lock contention and mixed up the thoughts. Sending the data is captured in the wasted network bandwidth bullet, but I guess also includes some extra overhead on the apiserver side.

Throughput at the watch cache layer is not changed by this KEP since it's more correlated to the number of watchers right? Filtering out more events for each watcher doesn't change the total number of watchers and so that's still an unsolved problem outside the scope of this KEP?

Throughput at the watch cache layer is not changed by this KEP since it's more correlated to the number of watchers right? Filtering out more events for each watcher doesn't change the total number of watchers and so that's still an unsolved problem outside the scope of this KEP?

Actually - the throughput of watch cache layer may be slightly improved too (though I don't expect non-marginal changes unless we hugely adopt it).
Or maybe changing it - it is possible to also improve throughput of watchcache if we add additional indexing based on shard (that we can relatively easily index on). We have this kind of indexing for "pod.spec.nodeName", as well as for "metadata.namespace/metadata.name" and those were game changers for throughput/concention of watchcache.

So not out of the box, but it has a potential to improve it too.

Exactly. Goal is to define our filter operations in such as way that they could be indexed, but leave the indexing as future work that we can do on an as-needed basis.

keps/sig-api-machinery/5866-shardable-watch/README.md

keps/sig-api-machinery/5866-server-side-sharded-list-and-watch/README.md

xigang · 2026-02-07T13:18:34Z

keps/sig-api-machinery/5866-shardable-watch/README.md

+proposal will be implemented, this is the place to discuss them.
+-->
+
+### API Extensibility: The `selector` Parameter


We can extend ListOptions to support hash-based sharding parameters.

type ListOptions struct { // ... existing fields ShardingIndex int64 // [0, ShardingCount) ShardingCount int64 // Total shards ShardingLabelKey string // Label/field to hash HashFunc string // "FNV32" }

Filtering logic:

hash(object.label[ShardingLabelKey]) % ShardingCount == ShardingIndex

Applied in SelectionPredicate at the storage layer, working for both LIST and WATCH.

This is captured in the alternatives section. Here's the snippet for the con of this approach:
Cons: Increases the surface area of meta/v1.ListOptions with fields that are specific only to watch sharding. This introduces new combinations of parameters and increases the risk of test coverage gaps. It is also less flexible for future evolution.

IMO, fixing the partitioning to be modulo % is also not ideal for controllers wanting reshards to not move every key around, hence why we're deciding to use the consistent hash ring so clients can specify themselves.

sttts · 2026-02-10T07:55:44Z

keps/sig-api-machinery/5866-server-side-sharded-list-and-watch/README.md

+However, client-side sharding has a critical limitation: it does not reduce the incoming event
+volume per replica. Every replica still receives the full stream of events, paying the CPU and
+network cost to deserialize everything, only to discard items not belonging to their shard.
+Functionally, this makes horizontal scaling of the watch stream impossible. This results in:


there are attempts to have sharding labels and filter by them. It is not that it is impossible. But of course, labels need admission or yet another controller setting them. And they are leaking the controller topology into the object state.

persisting the sharding information into the object state seems like a slippery slope because of admission/controller and also being super difficult to reshard later. Having replicas added/removed seems very disruptive. That approach is listed in the alternatives section (label based sharding), for now we're going with the stateless approach of generating hash rings and clients can choose their own ranges.

keps/sig-api-machinery/5866-shardable-watch/README.md

jpbetz · 2026-02-11T18:16:44Z

/lgtm

It looks like we have all outstanding concerns addressed.

We'll decide HOW to represent the client request during API review, but we have two concrete options (multiple simple query params or a 'selector' param with a grammar)
Clients can check if the server sharded the request by inspecting ListMeta.
Client can check if the server supports sharding by checking with discovery.

serathius · 2026-02-12T09:36:53Z

keps/sig-api-machinery/5866-server-side-sharded-list-and-watch/README.md

+significantly due to high churn. Many controllers need to scale to handle this load.
+
+Historically, most controllers choose to scale vertically (e.g., `kube-controller-manager`), as
+there is no native support for sharding or partitioning the watch stream. Some specialized controllers (e.g.,


Note, kube-state-metrics is not a controller. It's like saying prometheus is controller because it watches pods and calculates things based on that.

serathius · 2026-02-12T09:41:17Z

keps/sig-api-machinery/5866-server-side-sharded-list-and-watch/README.md

+
+However, client-side sharding has a critical limitation: it does not reduce the incoming event
+volume per replica. Every replica still receives the full stream of events, paying the CPU and
+network cost to deserialize everything, only to discard items not belonging to their shard.


What's the estimated cost benefit we expect? It's not a question of if, but how much you are expecting. There are many other features that are easier to implement and can provide same network/CPU saving while not complicating architecture.

serathius · 2026-02-12T09:46:55Z

keps/sig-api-machinery/5866-server-side-sharded-list-and-watch/README.md

+
+### Server Design
+
+Currently, the Cacher broadcasts events to all watchers that match a simple Label/Field selector.


How sharding will affect APF cost calculation? Do we assume optimistic uniform distribution or worst case?

k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Feb 1, 2026

k8s-ci-robot requested review from jpbetz and sttts February 1, 2026 22:41

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 1, 2026

k8s-ci-robot requested a review from xigang February 2, 2026 04:10

k8s-ci-robot assigned wojtek-t Feb 2, 2026

Jefftree changed the title ~~[KEP-5866] Shardable Watch~~ [KEP-5867] Shardable Watch Feb 3, 2026

Jefftree changed the title ~~[KEP-5867] Shardable Watch~~ [KEP-5866] Shardable Watch Feb 3, 2026

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 3, 2026

jpbetz reviewed Feb 3, 2026

View reviewed changes

keps/sig-api-machinery/5866-server-side-sharded-list-and-watch/README.md Show resolved Hide resolved

keps/sig-api-machinery/5866-shardable-watch/README.md Outdated Show resolved Hide resolved

jpbetz reviewed Feb 3, 2026

View reviewed changes

keps/sig-api-machinery/5866-server-side-sharded-list-and-watch/README.md Show resolved Hide resolved

Jefftree changed the title ~~[KEP-5866] Shardable Watch~~ [KEP-5866] Server-side sharded watch Feb 4, 2026

Jefftree force-pushed the shard branch from 1691aba to 4f67851 Compare February 4, 2026 18:09

Jefftree force-pushed the shard branch 4 times, most recently from 14bc5ff to 2ea540b Compare February 4, 2026 18:51

liggitt reviewed Feb 4, 2026

View reviewed changes

wojtek-t reviewed Feb 6, 2026

View reviewed changes

Jefftree force-pushed the shard branch from 5457b1a to a9a75e4 Compare February 6, 2026 15:47

jpbetz mentioned this pull request Feb 6, 2026

Server-side Sharded List and Watch #5866

Open

4 tasks

xigang reviewed Feb 7, 2026

View reviewed changes

sttts reviewed Feb 10, 2026

View reviewed changes

Jefftree changed the title ~~[KEP-5866] Server-side sharded watch~~ [KEP-5866] Server-side sharded list and watch Feb 10, 2026

sttts reviewed Feb 11, 2026

View reviewed changes

keps/sig-api-machinery/5866-shardable-watch/README.md Outdated Show resolved Hide resolved

Jefftree force-pushed the shard branch from 6adfada to 4d44905 Compare February 11, 2026 17:26

Sharded List and Watch

f5cefec

Jefftree force-pushed the shard branch from 4d44905 to f5cefec Compare February 11, 2026 17:28

k8s-ci-robot assigned jpbetz Feb 11, 2026

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 11, 2026

k8s-ci-robot merged commit 8b87c71 into kubernetes:master Feb 11, 2026
4 checks passed

k8s-ci-robot added this to the v1.36 milestone Feb 11, 2026

serathius reviewed Feb 12, 2026

View reviewed changes

		sharding. Given the motivation is optimization, receiving the full stream is a safe fallback
		(client-side filtering should remain in place or be conditionally disabled).


		### Server Design

		Currently, the Cacher broadcasts events to all watchers that match a simple Label/Field selector.

Conversation

Jefftree commented Feb 1, 2026

Uh oh!

xigang commented Feb 2, 2026

Uh oh!

wojtek-t commented Feb 2, 2026

Uh oh!

jpbetz commented Feb 3, 2026

Uh oh!

Uh oh!

Uh oh!

jpbetz commented Feb 3, 2026

Uh oh!

Uh oh!

k8s-ci-robot commented Feb 4, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jpbetz commented Feb 11, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants