Skip to content

Remember last execution time to avoid re-executing everything upon restart #1605

@nadiamoe

Description

@nadiamoe

Currently, the SM Agent is entirely stateless, which makes is great for operational purposes. However, it does come with some downsides. One of this downsides is that, upon start, the agent has no way to tell how long ago (if any) a check was last executed. Currently, the agent tackles this problem by assuming it has never happened, and schedules the next execution to happen within 2 minutes of starting time:

if offset == 0 {
offset = randDuration(min(frequency, maxPublishInterval))
}

This has a number of small problems:

  1. Agents may run checks more often than expected if restarted. This can happen in both public and private probes.
  2. Agent restarts cause surges in load in public probes

I propose that we add an optional, pluggable interface that will allow the agent to store the last execution time for a check somewhere else, with the primary candidate for somewhere being a key/value database such as Valkey/Redict.

Providing a URI to this kv store will be optional, and if not supplied, the agent will behave as it does now: Assuming each check has never been executed on startup, and scheduling them immediately (with jitter).

Errors connecting to the kv store will be fatal for debuggability reasons. If a check has no last execution time recorded in the kv store, the agent will assume it has never been executed, which means the agent will degrade gracefully if the kv store is wiped or restarted.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions