Skip to content

feat(agent): optimise getting health checks from consul server#354

Open
shashankNandigama wants to merge 4 commits intomainfrom
shashank/optimise-getHealthChecks
Open

feat(agent): optimise getting health checks from consul server#354
shashankNandigama wants to merge 4 commits intomainfrom
shashank/optimise-getHealthChecks

Conversation

@shashankNandigama
Copy link
Contributor

@shashankNandigama shashankNandigama commented Mar 4, 2026

Issue

Delay in fetching new service healthChecks registered in new namespaces

Root cause

In ESM getHealthChecks() method, it fetches the list of namespaces and then loops over all namespaces to fetch the health checks present in those namespaces. It makes a blocking query to the API /v1/health/state/any. By default consul server has a waitTime of 5 minutes for blocking queries.

When there are for example 100 namespaces with some services in them and there are no new service registrations or de-registrations, or state changes in these namespaces the query to fetch healthChecks gets blocked for 5 mins for each namespace.Now if a new service is registered in a new namespace, this new namespace is recognised by esm only after it finishes iterating over 100 namespaces leading to a long delay.

Change

Consul server supports sending health checks in all namespaces in a single API call, updated esm to use this instead of sequentially fetching all health checks in all namespaces.

Validation
Tested the delay in esm fetching new services registered by registering upto ~50k services and it is taking <20 seconds for esm to add new services to its monitoring list.

PCI review checklist

  • I have documented a clear reason for, and description of, the change I am making.

  • If applicable, I've documented a plan to revert these changes if they require more than reverting the pull request.

  • If applicable, I've documented the impact of any changes to security controls.

    Examples of changes to security controls include using new access control methods, adding or removing logging pipelines, etc.

a.logger.Warn("Error querying for health check info", "error", err)
continue
a.namespaceWildcard = ""
a.logger.Warn("Failed to detect Consul edition after retries, defaulting to CE behavior", "error", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change the log level to error

if err == nil {
break
}
a.logger.Warn("Error fetching agent info for enterprise detection, will retry",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change the log level to error

// Enterprise builds include "+ent" in the version (e.g. "1.18.0+ent").
func (a *Agent) getNamespaceWildcard() string {
a.namespaceWildcardOnce.Do(func() {
maxRetries := 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe these constants should be declared at common place with proper details in the comment.

// Returns "*" for Enterprise Consul (wildcard across all namespaces) or "" for OSS/CE.
// Detection is done by checking the Consul version string from /v1/agent/self —
// Enterprise builds include "+ent" in the version (e.g. "1.18.0+ent").
func (a *Agent) getNamespaceWildcard() string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should cache the previous value and run this only when it is nil.
Basically execute it once when successfully detects CE/ENT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants