[KEP-5866] Server-side sharded list and watch#5867
[KEP-5866] Server-side sharded list and watch#5867k8s-ci-robot merged 1 commit intokubernetes:masterfrom
Conversation
|
/cc |
|
/assign |
|
/approve |
keps/sig-api-machinery/5866-server-side-sharded-list-and-watch/README.md
Show resolved
Hide resolved
keps/sig-api-machinery/5866-server-side-sharded-list-and-watch/README.md
Show resolved
Hide resolved
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Jefftree, jpbetz The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
14bc5ff to
2ea540b
Compare
| sharding. Given the motivation is optimization, receiving the full stream is a safe fallback | ||
| (client-side filtering should remain in place or be conditionally disabled). |
There was a problem hiding this comment.
As a sharded client, getting things outside the shard I requested is a really sharp edge. Is there a more deterministic way to recognize immediately that a server doesn't support sharding and client-side filtering is needed? Are we expecting to bake in a uid and namespace hash-based filter implementation that matches the server into the client library?
There was a problem hiding this comment.
Hm, the client could query the OpenAPI endpoint to see if the new parameter exists. They can gate based on server version but there is always the risk of the feature being turned off. Many projects already do client-side filtering, and I'd imagine just exposing our hash algorithm and comparison logic in a common library (eg: apimachinery) is enough without being attached to a specific field.
|
|
||
| - Wasted network bandwidth (N replicas * Full Stream). | ||
| - Wasted CPU and Memory on clients for processing irrelevant events. | ||
| - Increased lock contention on the API Server side serialized same events for multiple watchers. |
There was a problem hiding this comment.
nit: When the same object is sent to multiple watchers, it actually is serialized once (per serialization format) - ref https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/1152-less-object-serializations
Also, I'm not sure we have lock-contention issues related to it.
Instead, where we have problems are:
- resource consumption related to the fact that we actually need to send all that data
- throughput at the watchcache layer ("first line" event dispatching is single-threaded and we iterate over all watchers interested in a given event to put it into appropriate channel); so the throughput that we can achieve directly depends on the number of those watchers
There was a problem hiding this comment.
oops, I was trying to say informer lock contention and mixed up the thoughts. Sending the data is captured in the wasted network bandwidth bullet, but I guess also includes some extra overhead on the apiserver side.
Throughput at the watch cache layer is not changed by this KEP since it's more correlated to the number of watchers right? Filtering out more events for each watcher doesn't change the total number of watchers and so that's still an unsolved problem outside the scope of this KEP?
There was a problem hiding this comment.
Throughput at the watch cache layer is not changed by this KEP since it's more correlated to the number of watchers right? Filtering out more events for each watcher doesn't change the total number of watchers and so that's still an unsolved problem outside the scope of this KEP?
Actually - the throughput of watch cache layer may be slightly improved too (though I don't expect non-marginal changes unless we hugely adopt it).
Or maybe changing it - it is possible to also improve throughput of watchcache if we add additional indexing based on shard (that we can relatively easily index on). We have this kind of indexing for "pod.spec.nodeName", as well as for "metadata.namespace/metadata.name" and those were game changers for throughput/concention of watchcache.
So not out of the box, but it has a potential to improve it too.
There was a problem hiding this comment.
Exactly. Goal is to define our filter operations in such as way that they could be indexed, but leave the indexing as future work that we can do on an as-needed basis.
keps/sig-api-machinery/5866-server-side-sharded-list-and-watch/README.md
Show resolved
Hide resolved
| proposal will be implemented, this is the place to discuss them. | ||
| --> | ||
|
|
||
| ### API Extensibility: The `selector` Parameter |
There was a problem hiding this comment.
We can extend ListOptions to support hash-based sharding parameters.
type ListOptions struct {
// ... existing fields
ShardingIndex int64 // [0, ShardingCount)
ShardingCount int64 // Total shards
ShardingLabelKey string // Label/field to hash
HashFunc string // "FNV32"
}Filtering logic:
hash(object.label[ShardingLabelKey]) % ShardingCount == ShardingIndex
Applied in SelectionPredicate at the storage layer, working for both LIST and WATCH.
There was a problem hiding this comment.
This is captured in the alternatives section. Here's the snippet for the con of this approach:
Cons: Increases the surface area of meta/v1.ListOptions with fields that are specific only to watch sharding. This introduces new combinations of parameters and increases the risk of test coverage gaps. It is also less flexible for future evolution.
IMO, fixing the partitioning to be modulo % is also not ideal for controllers wanting reshards to not move every key around, hence why we're deciding to use the consistent hash ring so clients can specify themselves.
| However, client-side sharding has a critical limitation: it does not reduce the incoming event | ||
| volume per replica. Every replica still receives the full stream of events, paying the CPU and | ||
| network cost to deserialize everything, only to discard items not belonging to their shard. | ||
| Functionally, this makes horizontal scaling of the watch stream impossible. This results in: |
There was a problem hiding this comment.
there are attempts to have sharding labels and filter by them. It is not that it is impossible. But of course, labels need admission or yet another controller setting them. And they are leaking the controller topology into the object state.
There was a problem hiding this comment.
persisting the sharding information into the object state seems like a slippery slope because of admission/controller and also being super difficult to reshard later. Having replicas added/removed seems very disruptive. That approach is listed in the alternatives section (label based sharding), for now we're going with the stateless approach of generating hash rings and clients can choose their own ranges.
|
/lgtm It looks like we have all outstanding concerns addressed.
|
| significantly due to high churn. Many controllers need to scale to handle this load. | ||
|
|
||
| Historically, most controllers choose to scale vertically (e.g., `kube-controller-manager`), as | ||
| there is no native support for sharding or partitioning the watch stream. Some specialized controllers (e.g., |
There was a problem hiding this comment.
Note, kube-state-metrics is not a controller. It's like saying prometheus is controller because it watches pods and calculates things based on that.
|
|
||
| However, client-side sharding has a critical limitation: it does not reduce the incoming event | ||
| volume per replica. Every replica still receives the full stream of events, paying the CPU and | ||
| network cost to deserialize everything, only to discard items not belonging to their shard. |
There was a problem hiding this comment.
What's the estimated cost benefit we expect? It's not a question of if, but how much you are expecting. There are many other features that are easier to implement and can provide same network/CPU saving while not complicating architecture.
|
|
||
| ### Server Design | ||
|
|
||
| Currently, the Cacher broadcasts events to all watchers that match a simple Label/Field selector. |
There was a problem hiding this comment.
How sharding will affect APF cost calculation? Do we assume optimistic uniform distribution or worst case?
/sig api-machinery