NETOBSERV-2574 Reconcile race condition - API discovery #2348
NETOBSERV-2574 Reconcile race condition - API discovery #2348jpinsonneau merged 3 commits intonetobserv:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2348 +/- ##
==========================================
- Coverage 73.83% 72.35% -1.49%
==========================================
Files 90 93 +3
Lines 10083 10380 +297
==========================================
+ Hits 7445 7510 +65
- Misses 2182 2402 +220
- Partials 456 468 +12
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
| for gv, err := range discErr.Groups { | ||
| for gv, gvErr := range discErr.Groups { | ||
| if strings.Contains(apiName, gv.String()) { | ||
| log.Error(err, "some API-related features are unavailable; you can check for stale APIs with 'kubectl get apiservice'", "GroupVersion", gv.String()) |
There was a problem hiding this comment.
@jpinsonneau was that message displayed in the logs or not, in the support case? I believe it should have?
There was a problem hiding this comment.
I haven't seen it but I'm having only partial logs here.
That would either confirm the API discovery completely failing and the code never reaching that log or the failure happend earlier and the discovery never been refreshed.
internal/pkg/cluster/cluster.go
Outdated
| log.Error(err, "some API-related features are unavailable; you can check for stale APIs with 'kubectl get apiservice'", "GroupVersion", gv.String()) | ||
| log.Error(gvErr, "some API-related features are unavailable; you can check for stale APIs with 'kubectl get apiservice'", "GroupVersion", gv.String(), "api", apiName) | ||
|
|
||
| // OCP Security API is critical - we MUST know if we're on OpenShift |
There was a problem hiding this comment.
I think the logic works fine for detecting openshift, but this case uncovers other issues that we may have with other APIs - even if that's not as critical, it's also a bug if, for instance, the servicemonitors API is temporarily down at startup, then later comes up: we would always consider it down.
So I think we should include the discovering process in the refresh mechanism
There was a problem hiding this comment.
That should make it: 0cb952a
Also took the opportunity to implement an exponential backoff retry mechanism from 30s to 10min max when failing.
internal/pkg/cluster/cluster.go
Outdated
| // Track which critical APIs failed discovery | ||
| criticalAPIFailed := false | ||
|
|
||
| for apiName := range newApisMap { | ||
| if hasAPI(apiName, resources) { | ||
| apisMap[apiName] = true | ||
| } else if discErr != nil { | ||
| newApisMap[apiName] = true | ||
| } else if hasDiscoveryError { | ||
| // Check if the wanted API is in error | ||
| for gv, err := range discErr.Groups { | ||
| for gv, gvErr := range discErr.Groups { | ||
| if strings.Contains(apiName, gv.String()) { | ||
| log.Error(err, "some API-related features are unavailable; you can check for stale APIs with 'kubectl get apiservice'", "GroupVersion", gv.String()) | ||
| log.Error(gvErr, "some API-related features are unavailable; you can check for stale APIs with 'kubectl get apiservice'", "GroupVersion", gv.String(), "api", apiName) | ||
|
|
||
| // OCP Security API is critical - we MUST know if we're on OpenShift | ||
| // to avoid wrong security context configurations | ||
| if apiName == ocpSecurity { | ||
| criticalAPIFailed = true | ||
| } | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // Check if any APIs transitioned from unavailable to available (recovery scenario) | ||
| c.apisMapLock.Lock() | ||
| defer c.apisMapLock.Unlock() | ||
| c.apisMap = apisMap | ||
| oldApisMap := c.apisMap | ||
| apisRecovered := false | ||
| if oldApisMap != nil { | ||
| for apiName, newAvailable := range newApisMap { | ||
| oldAvailable := oldApisMap[apiName] | ||
| if !oldAvailable && newAvailable { | ||
| log.Info("API recovered and is now available", "api", apiName) | ||
| apisRecovered = true | ||
| } | ||
| } | ||
| } | ||
| c.apisMap = newApisMap | ||
|
|
||
| log.Info("API detection finished", "apis", newApisMap) |
There was a problem hiding this comment.
I may have another concern, which I'm not sure of, it's how sounded is the API discovery/availability, as we've seen in the past API that could become "unavailable", which doesn't mean it's not supposed to be there. What to do if an API becomes unavailable, but still installed?
An option is to only update newly available APIs, not the newly unavailable ones. The downside is that we never remove APIs, but should we expect it to happen really?
Anyway, let me know your thoughts; I've tested the code below against your unit tests and that still passes
| // Track which critical APIs failed discovery | |
| criticalAPIFailed := false | |
| for apiName := range newApisMap { | |
| if hasAPI(apiName, resources) { | |
| apisMap[apiName] = true | |
| } else if discErr != nil { | |
| newApisMap[apiName] = true | |
| } else if hasDiscoveryError { | |
| // Check if the wanted API is in error | |
| for gv, err := range discErr.Groups { | |
| for gv, gvErr := range discErr.Groups { | |
| if strings.Contains(apiName, gv.String()) { | |
| log.Error(err, "some API-related features are unavailable; you can check for stale APIs with 'kubectl get apiservice'", "GroupVersion", gv.String()) | |
| log.Error(gvErr, "some API-related features are unavailable; you can check for stale APIs with 'kubectl get apiservice'", "GroupVersion", gv.String(), "api", apiName) | |
| // OCP Security API is critical - we MUST know if we're on OpenShift | |
| // to avoid wrong security context configurations | |
| if apiName == ocpSecurity { | |
| criticalAPIFailed = true | |
| } | |
| } | |
| } | |
| } | |
| } | |
| // Check if any APIs transitioned from unavailable to available (recovery scenario) | |
| c.apisMapLock.Lock() | |
| defer c.apisMapLock.Unlock() | |
| c.apisMap = apisMap | |
| oldApisMap := c.apisMap | |
| apisRecovered := false | |
| if oldApisMap != nil { | |
| for apiName, newAvailable := range newApisMap { | |
| oldAvailable := oldApisMap[apiName] | |
| if !oldAvailable && newAvailable { | |
| log.Info("API recovered and is now available", "api", apiName) | |
| apisRecovered = true | |
| } | |
| } | |
| } | |
| c.apisMap = newApisMap | |
| log.Info("API detection finished", "apis", newApisMap) | |
| // Track which critical APIs failed discovery | |
| criticalAPIFailed := false | |
| apisRecovered := false | |
| firstRun := false | |
| c.apisMapLock.Lock() | |
| defer c.apisMapLock.Unlock() | |
| if c.apisMap == nil { | |
| c.apisMap = map[string]bool{ | |
| consolePlugin: false, | |
| cno: false, | |
| svcMonitor: false, | |
| promRule: false, | |
| ocpSecurity: false, | |
| endpointSlices: false, | |
| } | |
| firstRun = true | |
| } | |
| for apiName, discovered := range c.apisMap { | |
| // Never remove a discovered API, to avoid transient staleness issues triggering changes continuously | |
| if !discovered { | |
| if hasAPI(apiName, resources) { | |
| c.apisMap[apiName] = true | |
| if !firstRun { | |
| apisRecovered = true | |
| log.Info("API recovered and is now available", "api", apiName) | |
| } | |
| } else if hasDiscoveryError { | |
| // Check if the wanted API is in error | |
| for gv, gvErr := range discErr.Groups { | |
| if strings.Contains(apiName, gv.String()) { | |
| log.Error(gvErr, "some API-related features are unavailable; you can check for stale APIs with 'kubectl get apiservice'", "GroupVersion", gv.String(), "api", apiName) | |
| // OCP Security API is critical - we MUST know if we're on OpenShift | |
| // to avoid wrong security context configurations | |
| if apiName == ocpSecurity { | |
| criticalAPIFailed = true | |
| } | |
| } | |
| } | |
| } | |
| } | |
| } | |
| if firstRun { | |
| log.Info("API detection finished", "apis", c.apisMap) | |
| } |
There was a problem hiding this comment.
Sounds valid to me 👌 We can improve later if we find a corner case
There was a problem hiding this comment.
I can't merge your suggestion directly: Applying suggestions on deleted lines is currently not supported.
Applied the changes here with a comment: e039f00
|
/ok-to-test |
|
New images:
They will expire after two weeks. To deploy this build: # Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:b61ca10 make deploy
# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-b61ca10Or as a Catalog Source: apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
name: netobserv-dev
namespace: openshift-marketplace
spec:
sourceType: grpc
image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-b61ca10
displayName: NetObserv development catalog
publisher: Me
updateStrategy:
registryPoll:
interval: 1m |
|
/label qe-approved |
Description
The operator was silently failing to create the
netobserv-hostnetwork-flpClusterRoleBinding on OpenShift clusters when the API server was slow or timing out during startup. This caused flowlogs-pipeline pods to fail with SCC (Security Context Constraint) violations.Root Cause
When the Kubernetes API server was slow during operator startup, the
fetchAvailableAPIs()function would receive partial discovery results with the OpenShift SCC API (security.openshift.io) marked as failed. The code treated this as success, settingIsOpenShift()tofalse, which caused the operator to skip creating required ClusterRoleBindings.Solution
Fail fast when critical APIs fail to be discovered:
Impact
Before
After
Dependencies
n/a
Checklist
If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.