router: pings may be way too agressive in large clusters and affect CPU

In some setups every storage is also a router. The example of such setup is TDG. Router's failover service pings every node in the cluster every second. The situation became much worse recently.

Before the commit https://github.com/tarantool/vshard/commit/17abd6975ac12b737ba2a8e6577ea0b3a179690a routers pinged cluster in the following loop:

```lua
for _, replicaset in pairs(router.replicasets) do
    local replica = replicaset.replica
    -- ping that replica
end
```

There, if number of replicasets was huge, we didn't ping them all at once, but did that sequentially, waiting for every replica to reply. It was rate-limited naturally. However, after the above mentioned commit we started to create separate fiber for every replica and ping it there. This allowed us not to skip pings, when some of the nodes are unavailable, but started to produce much more pings.

The situation became even worse, since that commit also started to ping all replicas in cluster, and not only prioritized. It was needed to automatically skip dead replicas in `callbro` requests.

---

So,  at the end we have a router, which pings every node in the cluster and does that with a lot of fibers. The number of fibers itself is not a problem, they don't affect performance too much, but the increased number of pings - is. We should probably ratelimit random pings somehow, e.g. 100 pings / second. Probably, some modification of the swim algorithm can be used here: select random subset of instances, the size of which is limited with some number, and ping only it, in order not to increase load on cluster due to pings with the growth of cluster size.

---

The indicator of that problem is constant high load on the cluster, `fiber.top` must show, that a lot of CPU time is spent, working in replica and net.box fibers, while there's no external load on the cluster.  In order to reproduce the issue, the number of routers must be very huge (several hundreds is not huge enough, we need thousands to affect CPU) 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

router: pings may be way too agressive in large clusters and affect CPU #640

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

router: pings may be way too agressive in large clusters and affect CPU #640

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions