-
Notifications
You must be signed in to change notification settings - Fork 35
Description
In some setups every storage is also a router. The example of such setup is TDG. Router's failover service pings every node in the cluster every second. The situation became much worse recently.
Before the commit 17abd69 routers pinged cluster in the following loop:
for _, replicaset in pairs(router.replicasets) do
local replica = replicaset.replica
-- ping that replica
endThere, if number of replicasets was huge, we didn't ping them all at once, but did that sequentially, waiting for every replica to reply. It was rate-limited naturally. However, after the above mentioned commit we started to create separate fiber for every replica and ping it there. This allowed us not to skip pings, when some of the nodes are unavailable, but started to produce much more pings.
The situation became even worse, since that commit also started to ping all replicas in cluster, and not only prioritized. It was needed to automatically skip dead replicas in callbro requests.
So, at the end we have a router, which pings every node in the cluster and does that with a lot of fibers. The number of fibers itself is not a problem, they don't affect performance too much, but the increased number of pings - is. We should probably ratelimit random pings somehow, e.g. 100 pings / second. Probably, some modification of the swim algorithm can be used here: select random subset of instances, the size of which is limited with some number, and ping only it, in order not to increase load on cluster due to pings with the growth of cluster size.
The indicator of that problem is constant high load on the cluster, fiber.top must show, that a lot of CPU time is spent, working in replica and net.box fibers, while there's no external load on the cluster. In order to reproduce the issue, the number of routers must be very huge (several hundreds is not huge enough, we need thousands to affect CPU)