Can initial cluster startup time be improved?

In my testing, I've observed that the cold start time for a new 3 node cluster can range from 1 to 2 minutes depending on the timing of instance startup. This seems to result from the unconditional and fixed etcd health checking interval (worst case 30s given retries and timeouts) and the absence of any assumptions which would allow the program to distinguish between an unreachable cluster and one that never existed (i.e. not worth checking and assumed to be unhealthy). 2 minutes is the average I observe over repeated testing in a more real world environment (statefulset on Kubernetes with persistent volumes, etc.). The usual case is that each instance must incur multiple futile etcd health checks before the state machines converge to the startup conditions.

I've been thinking about how to optimize initial startup time but haven't thought of a good solution yet that doesn't introduce some kind of shared state or that doesn't come with some other poor tradeoff (e.g. reducing the etcd health retries/timeouts and increasing the possibility of false negative health check results).

I suspect that the delay is basically a tradeoff that comes with the stateless architecture, but I wonder if this is something you've thought about before. Very curious to hear what you think!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can initial cluster startup time be improved? #72

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Can initial cluster startup time be improved? #72

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions