Skip to content

fix: init NNC from reconciler instead of directly#4199

Open
rbtr wants to merge 3 commits intomasterfrom
fix/init-from-reconcile
Open

fix: init NNC from reconciler instead of directly#4199
rbtr wants to merge 3 commits intomasterfrom
fix/init-from-reconcile

Conversation

@rbtr
Copy link
Collaborator

@rbtr rbtr commented Jan 22, 2026

Reason for Change:
There's no clear reason that CNS initializes with an NNC that it gets out of band from the typical NNC reconciler flow.
This separate path for fetching the NNC leads to bugs such as ingesting an NNC for a previous Node of the same name and state corruption once the reconciler (which is smart enough to wait for the actual NNC for its Node) starts.
Additionally, any caching, filtering, or other selection done to NNCs for efficiency is not applied or must be duplicated.

Instead, initialization can be run from the first NNC that the reconciler is pushed. This consolidates all CNS NNC reads to a single location where list-watch optimizations and filtering can be centralized.

@rbtr rbtr requested a review from a team as a code owner January 22, 2026 00:30
@rbtr rbtr requested review from Copilot and timraymond January 22, 2026 00:30
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request refactors the CNS (Container Networking Service) initialization to use the NNC (NodeNetworkConfig) reconciler flow instead of a separate out-of-band initialization path. This consolidates NNC reads to a single location, preventing bugs from ingesting stale NNCs and ensuring caching/filtering optimizations are consistently applied.

Changes:

  • Removed direct NNC client initialization and replaced it with an initializer callback passed to the reconciler
  • Modified the reconciler to invoke the initializer function on the first reconcile
  • Updated function signatures to accept the initializer as a parameter

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.

File Description
cns/service/main.go Removed direct NNC client creation and retry logic; replaced with initialization closure passed to reconciler
cns/kubecontroller/nodenetworkconfig/reconciler.go Added initializer callback field and invocation logic in Reconcile method
cns/kubecontroller/nodenetworkconfig/reconciler_test.go Updated test calls to include no-op initializer function parameter

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@rbtr
Copy link
Collaborator Author

rbtr commented Jan 22, 2026

/azp run Azure Container Networking PR

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@rbtr rbtr requested review from nddq and thatmattlong January 22, 2026 17:05
@rbtr rbtr force-pushed the fix/init-from-reconcile branch from cfab669 to 14b9b41 Compare January 23, 2026 17:48
@rbtr rbtr added bug cns Related to CNS. labels Jan 23, 2026
@rbtr
Copy link
Collaborator Author

rbtr commented Jan 23, 2026

/azp run Azure Container Networking PR

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@rbtr rbtr added the fix Fixes something. label Jan 23, 2026
logger.Errorf("[cns-rc] initializer failed during reconcile: %v", err)
return reconcile.Result{}, errors.Wrap(err, "initializer failed during reconcile")
}
r.initializer = nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's the same behavior for this to be wedged until initialization succeeds, but do we want to eventually circuit break? It seems like that was the original intention of the existing logic as I read the comment above it... whether that was true in practice may be dubious though because of swallowing the retry error and the UntilSucceeded().

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You read the prior intent correctly - we wanted to retry for a while and eventually fail out as a signal to the user that something was wrong. However, the retry was long enough that CNS wouldn't exit frequently enough to end up in CrashLoopBackoff and instead the Pod would just clock 5-6 restarts per hour.
Eventually someone might notice, but in practice this effectively was a retry-forever.

By tying this in to the Reconcile loop, we will be able to lean on the ctrlruntime machinery - it automatically does retry backoff and exposes metrics for failed reconciles which we can collect and use as a signal.

Copy link
Member

@nddq nddq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments, but the changes make sense

// call initFunc on first reconcile and never again
if r.initializer != nil {
if err := r.initializer(nnc); err != nil {
logger.Errorf("[cns-rc] initializer failed during reconcile: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cns-rc? that's new 🙂

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

someone else started this actually, it's all over this file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug cns Related to CNS. fix Fixes something.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Replacing a Node with the same name in Overlay can corrupt CNS state

3 participants