fix: init NNC from reconciler instead of directly#4199
Conversation
There was a problem hiding this comment.
Pull request overview
This pull request refactors the CNS (Container Networking Service) initialization to use the NNC (NodeNetworkConfig) reconciler flow instead of a separate out-of-band initialization path. This consolidates NNC reads to a single location, preventing bugs from ingesting stale NNCs and ensuring caching/filtering optimizations are consistently applied.
Changes:
- Removed direct NNC client initialization and replaced it with an initializer callback passed to the reconciler
- Modified the reconciler to invoke the initializer function on the first reconcile
- Updated function signatures to accept the initializer as a parameter
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| cns/service/main.go | Removed direct NNC client creation and retry logic; replaced with initialization closure passed to reconciler |
| cns/kubecontroller/nodenetworkconfig/reconciler.go | Added initializer callback field and invocation logic in Reconcile method |
| cns/kubecontroller/nodenetworkconfig/reconciler_test.go | Updated test calls to include no-op initializer function parameter |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/azp run Azure Container Networking PR |
|
Azure Pipelines successfully started running 1 pipeline(s). |
cfab669 to
14b9b41
Compare
|
/azp run Azure Container Networking PR |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| logger.Errorf("[cns-rc] initializer failed during reconcile: %v", err) | ||
| return reconcile.Result{}, errors.Wrap(err, "initializer failed during reconcile") | ||
| } | ||
| r.initializer = nil |
There was a problem hiding this comment.
I think it's the same behavior for this to be wedged until initialization succeeds, but do we want to eventually circuit break? It seems like that was the original intention of the existing logic as I read the comment above it... whether that was true in practice may be dubious though because of swallowing the retry error and the UntilSucceeded().
There was a problem hiding this comment.
You read the prior intent correctly - we wanted to retry for a while and eventually fail out as a signal to the user that something was wrong. However, the retry was long enough that CNS wouldn't exit frequently enough to end up in CrashLoopBackoff and instead the Pod would just clock 5-6 restarts per hour.
Eventually someone might notice, but in practice this effectively was a retry-forever.
By tying this in to the Reconcile loop, we will be able to lean on the ctrlruntime machinery - it automatically does retry backoff and exposes metrics for failed reconciles which we can collect and use as a signal.
nddq
left a comment
There was a problem hiding this comment.
minor comments, but the changes make sense
| // call initFunc on first reconcile and never again | ||
| if r.initializer != nil { | ||
| if err := r.initializer(nnc); err != nil { | ||
| logger.Errorf("[cns-rc] initializer failed during reconcile: %v", err) |
There was a problem hiding this comment.
someone else started this actually, it's all over this file?
There was a problem hiding this comment.
Reason for Change:
There's no clear reason that CNS initializes with an NNC that it gets out of band from the typical NNC reconciler flow.
This separate path for fetching the NNC leads to bugs such as ingesting an NNC for a previous Node of the same name and state corruption once the reconciler (which is smart enough to wait for the actual NNC for its Node) starts.
Additionally, any caching, filtering, or other selection done to NNCs for efficiency is not applied or must be duplicated.
Instead, initialization can be run from the first NNC that the reconciler is pushed. This consolidates all CNS NNC reads to a single location where list-watch optimizations and filtering can be centralized.