Skip to content

NETOBSERV-2574 Reconcile race condition - API discovery #2348

Merged
jpinsonneau merged 3 commits intonetobserv:mainfrom
jpinsonneau:2574
Jan 26, 2026
Merged

NETOBSERV-2574 Reconcile race condition - API discovery #2348
jpinsonneau merged 3 commits intonetobserv:mainfrom
jpinsonneau:2574

Conversation

@jpinsonneau
Copy link
Member

@jpinsonneau jpinsonneau commented Jan 20, 2026

Description

The operator was silently failing to create the netobserv-hostnetwork-flp ClusterRoleBinding on OpenShift clusters when the API server was slow or timing out during startup. This caused flowlogs-pipeline pods to fail with SCC (Security Context Constraint) violations.

Root Cause

When the Kubernetes API server was slow during operator startup, the fetchAvailableAPIs() function would receive partial discovery results with the OpenShift SCC API (security.openshift.io) marked as failed. The code treated this as success, setting IsOpenShift() to false, which caused the operator to skip creating required ClusterRoleBindings.

Solution

Fail fast when critical APIs fail to be discovered:

  • Added validation to detect when the OpenShift SCC API is unavailable
  • Return an error instead of silently continuing with incorrect cluster detection
  • OLM will retry operator startup until the API server is responsive

Impact

Before

  • API discovery could fail silently during slow API server conditions
  • Operator would start with incorrect cluster type detection
  • Required ClusterRoleBindings were never created
  • Pods failed to start with SCC violations
  • Issue persisted for 10+ days until manual restart

After

  • Critical API discovery failures are detected immediately
  • Operator refuses to start with clear error message
  • OLM automatically retries until API server is healthy

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
    • If so, make sure the JIRA epic is labeled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
    • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
    • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
    • Standard QE validation, with pre-merge tests unless stated otherwise.
    • Regression tests only (e.g. refactoring with no user-facing change).
    • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

@openshift-ci
Copy link

openshift-ci bot commented Jan 20, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jpinsonneau for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jpinsonneau jpinsonneau requested a review from jotak January 20, 2026 17:36
@codecov
Copy link

codecov bot commented Jan 20, 2026

Codecov Report

❌ Patch coverage is 75.00000% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.35%. Comparing base (3d5432d) to head (e039f00).
⚠️ Report is 15 commits behind head on main.

Files with missing lines Patch % Lines
internal/pkg/cluster/refresher.go 21.05% 15 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2348      +/-   ##
==========================================
- Coverage   73.83%   72.35%   -1.49%     
==========================================
  Files          90       93       +3     
  Lines       10083    10380     +297     
==========================================
+ Hits         7445     7510      +65     
- Misses       2182     2402     +220     
- Partials      456      468      +12     
Flag Coverage Δ
unittests 72.35% <75.00%> (-1.49%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
internal/pkg/cluster/cluster.go 85.58% <100.00%> (+5.16%) ⬆️
internal/pkg/cluster/refresher.go 39.39% <21.05%> (-5.61%) ⬇️

... and 16 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

for gv, err := range discErr.Groups {
for gv, gvErr := range discErr.Groups {
if strings.Contains(apiName, gv.String()) {
log.Error(err, "some API-related features are unavailable; you can check for stale APIs with 'kubectl get apiservice'", "GroupVersion", gv.String())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpinsonneau was that message displayed in the logs or not, in the support case? I believe it should have?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't seen it but I'm having only partial logs here.

That would either confirm the API discovery completely failing and the code never reaching that log or the failure happend earlier and the discovery never been refreshed.

log.Error(err, "some API-related features are unavailable; you can check for stale APIs with 'kubectl get apiservice'", "GroupVersion", gv.String())
log.Error(gvErr, "some API-related features are unavailable; you can check for stale APIs with 'kubectl get apiservice'", "GroupVersion", gv.String(), "api", apiName)

// OCP Security API is critical - we MUST know if we're on OpenShift
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the logic works fine for detecting openshift, but this case uncovers other issues that we may have with other APIs - even if that's not as critical, it's also a bug if, for instance, the servicemonitors API is temporarily down at startup, then later comes up: we would always consider it down.
So I think we should include the discovering process in the refresh mechanism

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should make it: 0cb952a

Also took the opportunity to implement an exponential backoff retry mechanism from 30s to 10min max when failing.

Comment on lines +102 to +140
// Track which critical APIs failed discovery
criticalAPIFailed := false

for apiName := range newApisMap {
if hasAPI(apiName, resources) {
apisMap[apiName] = true
} else if discErr != nil {
newApisMap[apiName] = true
} else if hasDiscoveryError {
// Check if the wanted API is in error
for gv, err := range discErr.Groups {
for gv, gvErr := range discErr.Groups {
if strings.Contains(apiName, gv.String()) {
log.Error(err, "some API-related features are unavailable; you can check for stale APIs with 'kubectl get apiservice'", "GroupVersion", gv.String())
log.Error(gvErr, "some API-related features are unavailable; you can check for stale APIs with 'kubectl get apiservice'", "GroupVersion", gv.String(), "api", apiName)

// OCP Security API is critical - we MUST know if we're on OpenShift
// to avoid wrong security context configurations
if apiName == ocpSecurity {
criticalAPIFailed = true
}
}
}
}
}

// Check if any APIs transitioned from unavailable to available (recovery scenario)
c.apisMapLock.Lock()
defer c.apisMapLock.Unlock()
c.apisMap = apisMap
oldApisMap := c.apisMap
apisRecovered := false
if oldApisMap != nil {
for apiName, newAvailable := range newApisMap {
oldAvailable := oldApisMap[apiName]
if !oldAvailable && newAvailable {
log.Info("API recovered and is now available", "api", apiName)
apisRecovered = true
}
}
}
c.apisMap = newApisMap

log.Info("API detection finished", "apis", newApisMap)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have another concern, which I'm not sure of, it's how sounded is the API discovery/availability, as we've seen in the past API that could become "unavailable", which doesn't mean it's not supposed to be there. What to do if an API becomes unavailable, but still installed?

An option is to only update newly available APIs, not the newly unavailable ones. The downside is that we never remove APIs, but should we expect it to happen really?

Anyway, let me know your thoughts; I've tested the code below against your unit tests and that still passes

Suggested change
// Track which critical APIs failed discovery
criticalAPIFailed := false
for apiName := range newApisMap {
if hasAPI(apiName, resources) {
apisMap[apiName] = true
} else if discErr != nil {
newApisMap[apiName] = true
} else if hasDiscoveryError {
// Check if the wanted API is in error
for gv, err := range discErr.Groups {
for gv, gvErr := range discErr.Groups {
if strings.Contains(apiName, gv.String()) {
log.Error(err, "some API-related features are unavailable; you can check for stale APIs with 'kubectl get apiservice'", "GroupVersion", gv.String())
log.Error(gvErr, "some API-related features are unavailable; you can check for stale APIs with 'kubectl get apiservice'", "GroupVersion", gv.String(), "api", apiName)
// OCP Security API is critical - we MUST know if we're on OpenShift
// to avoid wrong security context configurations
if apiName == ocpSecurity {
criticalAPIFailed = true
}
}
}
}
}
// Check if any APIs transitioned from unavailable to available (recovery scenario)
c.apisMapLock.Lock()
defer c.apisMapLock.Unlock()
c.apisMap = apisMap
oldApisMap := c.apisMap
apisRecovered := false
if oldApisMap != nil {
for apiName, newAvailable := range newApisMap {
oldAvailable := oldApisMap[apiName]
if !oldAvailable && newAvailable {
log.Info("API recovered and is now available", "api", apiName)
apisRecovered = true
}
}
}
c.apisMap = newApisMap
log.Info("API detection finished", "apis", newApisMap)
// Track which critical APIs failed discovery
criticalAPIFailed := false
apisRecovered := false
firstRun := false
c.apisMapLock.Lock()
defer c.apisMapLock.Unlock()
if c.apisMap == nil {
c.apisMap = map[string]bool{
consolePlugin: false,
cno: false,
svcMonitor: false,
promRule: false,
ocpSecurity: false,
endpointSlices: false,
}
firstRun = true
}
for apiName, discovered := range c.apisMap {
// Never remove a discovered API, to avoid transient staleness issues triggering changes continuously
if !discovered {
if hasAPI(apiName, resources) {
c.apisMap[apiName] = true
if !firstRun {
apisRecovered = true
log.Info("API recovered and is now available", "api", apiName)
}
} else if hasDiscoveryError {
// Check if the wanted API is in error
for gv, gvErr := range discErr.Groups {
if strings.Contains(apiName, gv.String()) {
log.Error(gvErr, "some API-related features are unavailable; you can check for stale APIs with 'kubectl get apiservice'", "GroupVersion", gv.String(), "api", apiName)
// OCP Security API is critical - we MUST know if we're on OpenShift
// to avoid wrong security context configurations
if apiName == ocpSecurity {
criticalAPIFailed = true
}
}
}
}
}
}
if firstRun {
log.Info("API detection finished", "apis", c.apisMap)
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds valid to me 👌 We can improve later if we find a corner case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't merge your suggestion directly: Applying suggestions on deleted lines is currently not supported.
Applied the changes here with a comment: e039f00

Copy link
Member

@jotak jotak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@kapjain-rh
Copy link
Member

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jan 23, 2026
@github-actions
Copy link

New images:

  • quay.io/netobserv/network-observability-operator:b61ca10
  • quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-b61ca10
  • quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-b61ca10

They will expire after two weeks.

To deploy this build:

# Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:b61ca10 make deploy

# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-b61ca10

Or as a Catalog Source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-b61ca10
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

@kapjain-rh
Copy link
Member

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved QE has approved this pull request label Jan 23, 2026
@jpinsonneau jpinsonneau merged commit d3a56da into netobserv:main Jan 26, 2026
14 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. qe-approved QE has approved this pull request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants