feat(agent): optimise getting health checks from consul server#354
Open
shashankNandigama wants to merge 4 commits intomainfrom
Open
feat(agent): optimise getting health checks from consul server#354shashankNandigama wants to merge 4 commits intomainfrom
shashankNandigama wants to merge 4 commits intomainfrom
Conversation
3 tasks
| a.logger.Warn("Error querying for health check info", "error", err) | ||
| continue | ||
| a.namespaceWildcard = "" | ||
| a.logger.Warn("Failed to detect Consul edition after retries, defaulting to CE behavior", "error", err) |
Contributor
There was a problem hiding this comment.
Change the log level to error
| if err == nil { | ||
| break | ||
| } | ||
| a.logger.Warn("Error fetching agent info for enterprise detection, will retry", |
Contributor
There was a problem hiding this comment.
Change the log level to error
| // Enterprise builds include "+ent" in the version (e.g. "1.18.0+ent"). | ||
| func (a *Agent) getNamespaceWildcard() string { | ||
| a.namespaceWildcardOnce.Do(func() { | ||
| maxRetries := 3 |
Contributor
There was a problem hiding this comment.
i believe these constants should be declared at common place with proper details in the comment.
| // Returns "*" for Enterprise Consul (wildcard across all namespaces) or "" for OSS/CE. | ||
| // Detection is done by checking the Consul version string from /v1/agent/self — | ||
| // Enterprise builds include "+ent" in the version (e.g. "1.18.0+ent"). | ||
| func (a *Agent) getNamespaceWildcard() string { |
Contributor
There was a problem hiding this comment.
We should cache the previous value and run this only when it is nil.
Basically execute it once when successfully detects CE/ENT.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue
Delay in fetching new service healthChecks registered in new namespaces
Root cause
In ESM getHealthChecks() method, it fetches the list of namespaces and then loops over all namespaces to fetch the health checks present in those namespaces. It makes a blocking query to the API /v1/health/state/any. By default consul server has a waitTime of 5 minutes for blocking queries.
When there are for example 100 namespaces with some services in them and there are no new service registrations or de-registrations, or state changes in these namespaces the query to fetch healthChecks gets blocked for 5 mins for each namespace.Now if a new service is registered in a new namespace, this new namespace is recognised by esm only after it finishes iterating over 100 namespaces leading to a long delay.
Change
Consul server supports sending health checks in all namespaces in a single API call, updated esm to use this instead of sequentially fetching all health checks in all namespaces.
Validation
Tested the delay in esm fetching new services registered by registering upto ~50k services and it is taking <20 seconds for esm to add new services to its monitoring list.
PCI review checklist
I have documented a clear reason for, and description of, the change I am making.
If applicable, I've documented a plan to revert these changes if they require more than reverting the pull request.
If applicable, I've documented the impact of any changes to security controls.
Examples of changes to security controls include using new access control methods, adding or removing logging pipelines, etc.