Skip to content

[KEP-5866] Server-side sharded list and watch#5867

Merged
k8s-ci-robot merged 1 commit intokubernetes:masterfrom
Jefftree:shard
Feb 11, 2026
Merged

[KEP-5866] Server-side sharded list and watch#5867
k8s-ci-robot merged 1 commit intokubernetes:masterfrom
Jefftree:shard

Conversation

@Jefftree
Copy link
Member

@Jefftree Jefftree commented Feb 1, 2026

  • One-line PR description: Shardable Watch
  • Other comments:

/sig api-machinery

@k8s-ci-robot k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Feb 1, 2026
@k8s-ci-robot k8s-ci-robot requested review from jpbetz and sttts February 1, 2026 22:41
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 1, 2026
@xigang
Copy link
Member

xigang commented Feb 2, 2026

/cc

@k8s-ci-robot k8s-ci-robot requested a review from xigang February 2, 2026 04:10
@wojtek-t
Copy link
Member

wojtek-t commented Feb 2, 2026

/assign

@Jefftree Jefftree changed the title [KEP-5866] Shardable Watch [KEP-5867] Shardable Watch Feb 3, 2026
@Jefftree Jefftree changed the title [KEP-5867] Shardable Watch [KEP-5866] Shardable Watch Feb 3, 2026
@jpbetz
Copy link
Contributor

jpbetz commented Feb 3, 2026

/approve
For PRR for alpha only. We still need to do KEP review.

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 3, 2026
@jpbetz
Copy link
Contributor

jpbetz commented Feb 3, 2026

@deads2k @sttts

@Jefftree Jefftree changed the title [KEP-5866] Shardable Watch [KEP-5866] Server-side sharded watch Feb 4, 2026
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Jefftree, jpbetz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Jefftree Jefftree force-pushed the shard branch 4 times, most recently from 14bc5ff to 2ea540b Compare February 4, 2026 18:51
Comment on lines 536 to 537
sharding. Given the motivation is optimization, receiving the full stream is a safe fallback
(client-side filtering should remain in place or be conditionally disabled).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a sharded client, getting things outside the shard I requested is a really sharp edge. Is there a more deterministic way to recognize immediately that a server doesn't support sharding and client-side filtering is needed? Are we expecting to bake in a uid and namespace hash-based filter implementation that matches the server into the client library?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, the client could query the OpenAPI endpoint to see if the new parameter exists. They can gate based on server version but there is always the risk of the feature being turned off. Many projects already do client-side filtering, and I'd imagine just exposing our hash algorithm and comparison logic in a common library (eg: apimachinery) is enough without being attached to a specific field.


- Wasted network bandwidth (N replicas * Full Stream).
- Wasted CPU and Memory on clients for processing irrelevant events.
- Increased lock contention on the API Server side serialized same events for multiple watchers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: When the same object is sent to multiple watchers, it actually is serialized once (per serialization format) - ref https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/1152-less-object-serializations
Also, I'm not sure we have lock-contention issues related to it.

Instead, where we have problems are:

  • resource consumption related to the fact that we actually need to send all that data
  • throughput at the watchcache layer ("first line" event dispatching is single-threaded and we iterate over all watchers interested in a given event to put it into appropriate channel); so the throughput that we can achieve directly depends on the number of those watchers

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, I was trying to say informer lock contention and mixed up the thoughts. Sending the data is captured in the wasted network bandwidth bullet, but I guess also includes some extra overhead on the apiserver side.

Throughput at the watch cache layer is not changed by this KEP since it's more correlated to the number of watchers right? Filtering out more events for each watcher doesn't change the total number of watchers and so that's still an unsolved problem outside the scope of this KEP?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throughput at the watch cache layer is not changed by this KEP since it's more correlated to the number of watchers right? Filtering out more events for each watcher doesn't change the total number of watchers and so that's still an unsolved problem outside the scope of this KEP?

Actually - the throughput of watch cache layer may be slightly improved too (though I don't expect non-marginal changes unless we hugely adopt it).
Or maybe changing it - it is possible to also improve throughput of watchcache if we add additional indexing based on shard (that we can relatively easily index on). We have this kind of indexing for "pod.spec.nodeName", as well as for "metadata.namespace/metadata.name" and those were game changers for throughput/concention of watchcache.

So not out of the box, but it has a potential to improve it too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. Goal is to define our filter operations in such as way that they could be indexed, but leave the indexing as future work that we can do on an as-needed basis.

proposal will be implemented, this is the place to discuss them.
-->

### API Extensibility: The `selector` Parameter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can extend ListOptions to support hash-based sharding parameters.

type ListOptions struct {
    // ... existing fields
    ShardingIndex    int64   // [0, ShardingCount)
    ShardingCount    int64   // Total shards
    ShardingLabelKey string  // Label/field to hash
    HashFunc         string  // "FNV32"
}

Filtering logic:

hash(object.label[ShardingLabelKey]) % ShardingCount == ShardingIndex

Applied in SelectionPredicate at the storage layer, working for both LIST and WATCH.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is captured in the alternatives section. Here's the snippet for the con of this approach:
Cons: Increases the surface area of meta/v1.ListOptions with fields that are specific only to watch sharding. This introduces new combinations of parameters and increases the risk of test coverage gaps. It is also less flexible for future evolution.

IMO, fixing the partitioning to be modulo % is also not ideal for controllers wanting reshards to not move every key around, hence why we're deciding to use the consistent hash ring so clients can specify themselves.

However, client-side sharding has a critical limitation: it does not reduce the incoming event
volume per replica. Every replica still receives the full stream of events, paying the CPU and
network cost to deserialize everything, only to discard items not belonging to their shard.
Functionally, this makes horizontal scaling of the watch stream impossible. This results in:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are attempts to have sharding labels and filter by them. It is not that it is impossible. But of course, labels need admission or yet another controller setting them. And they are leaking the controller topology into the object state.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

persisting the sharding information into the object state seems like a slippery slope because of admission/controller and also being super difficult to reshard later. Having replicas added/removed seems very disruptive. That approach is listed in the alternatives section (label based sharding), for now we're going with the stateless approach of generating hash rings and clients can choose their own ranges.

@Jefftree Jefftree changed the title [KEP-5866] Server-side sharded watch [KEP-5866] Server-side sharded list and watch Feb 10, 2026
@jpbetz
Copy link
Contributor

jpbetz commented Feb 11, 2026

/lgtm

It looks like we have all outstanding concerns addressed.

  • We'll decide HOW to represent the client request during API review, but we have two concrete options (multiple simple query params or a 'selector' param with a grammar)
  • Clients can check if the server sharded the request by inspecting ListMeta.
  • Client can check if the server supports sharding by checking with discovery.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 11, 2026
@k8s-ci-robot k8s-ci-robot merged commit 8b87c71 into kubernetes:master Feb 11, 2026
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.36 milestone Feb 11, 2026
significantly due to high churn. Many controllers need to scale to handle this load.

Historically, most controllers choose to scale vertically (e.g., `kube-controller-manager`), as
there is no native support for sharding or partitioning the watch stream. Some specialized controllers (e.g.,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, kube-state-metrics is not a controller. It's like saying prometheus is controller because it watches pods and calculates things based on that.


However, client-side sharding has a critical limitation: it does not reduce the incoming event
volume per replica. Every replica still receives the full stream of events, paying the CPU and
network cost to deserialize everything, only to discard items not belonging to their shard.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the estimated cost benefit we expect? It's not a question of if, but how much you are expecting. There are many other features that are easier to implement and can provide same network/CPU saving while not complicating architecture.


### Server Design

Currently, the Cacher broadcasts events to all watchers that match a simple Label/Field selector.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How sharding will affect APF cost calculation? Do we assume optimistic uniform distribution or worst case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants