ci: fix known cniv1 pipeline issue and improve log collection#4183
ci: fix known cniv1 pipeline issue and improve log collection#4183
Conversation
If the downloaded cni log contains Initializing HTTP client with connection timeout If there is any other error, we fail the pipeline as normal right after the regular e2e step template finishes
This reverts commit c2ee459.
This reverts commit 476dc69.
without the toleration the privileged ds may sit at zero desired and will report as "successfully deployed"
|
/azp run Azure Container Networking PR |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Pull request overview
This PR addresses a known CNI v1 pipeline issue during IP allocation and improves log collection infrastructure. The changes introduce automated detection of known issues, enhance pod scheduling reliability, and refactor log collection into reusable scripts.
Key changes:
- Adds tolerations to privileged DaemonSets to ensure scheduling on all nodes regardless of taints
- Creates standalone log collection scripts for Linux and Windows that can be run both in pipelines and locally
- Implements a warning handler job that checks for known error patterns in logs and marks stages as succeeded with issues when detected
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 21 comments.
Show a summary per file
| File | Description |
|---|---|
| test/integration/manifests/load/privileged-daemonset.yaml | Adds broad toleration to ensure privileged pods schedule on all nodes |
| test/integration/manifests/load/privileged-daemonset-windows.yaml | Adds broad toleration to Windows privileged pods |
| hack/scripts/collect-windows-logs.sh | New reusable script for collecting Windows CNI/CNS logs |
| hack/scripts/collect-linux-logs.sh | New reusable script for collecting Linux CNI/CNS logs |
| hack/scripts/check-cni-log-contents.sh | New script to search logs for known issue patterns |
| .pipelines/templates/warning-handler-job-template.yaml | New template for handling warnings when known issues are detected |
| .pipelines/templates/log-template.yaml | Refactored to use new log collection scripts and added NNC description |
| .pipelines/singletenancy/aks/e2e-job-template.yaml | Integrates warning handler for CNI v1 Linux jobs |
| .pipelines/singletenancy/azure-cni-overlay-stateless/azure-cni-overlay-stateless-e2e-step-template.yaml | Adds verbose flag to datapath test |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Approved, discussed offline about the comments. The issue only occurred in pipeline so far so we will be skipping it as it has been discussed with @tamilmani1989 as per @QxBytes. |
|
/azp run Azure Container Networking PR |
|
Azure Pipelines successfully started running 1 pipeline(s). |
* initial modification * patch files * revert this after testing * update cat as powershell may start in C:\hpc folder * add verbose to stateless test since it timed out last time * add nnc to debug logs * continue on error in cniv1 case if we hit known issue If the downloaded cni log contains Initializing HTTP client with connection timeout If there is any other error, we fail the pipeline as normal right after the regular e2e step template finishes * set cni variable to cniv1 for nodesubnet case * revert after testing: force error * fix cluster name * add check cni log contents * set message to look for in logs * remove force fail * move to template * test unhappy path * Revert "test unhappy path" This reverts commit c2ee459. * Revert "revert this after testing" This reverts commit 476dc69. * make both os privileged debug pods tolerate all taints without the toleration the privileged ds may sit at zero desired and will report as "successfully deployed" * add comment
|
This PR can't be merged in until the E2E pipelines stop warning on every stage |
Reason for Change:
There is a known issue in the pipeline for cniv1 during ip allocation. A symptom of this is "Initializing HTTP client with connection timeout" showing up in the cni logs. This PR adds a script to check the contents of the logs for these known phrases and marks the stage as succeeded with warnings if so. If the phrase is not found but there is an error, we fail out as normal.
Additionally adds tolerations to the privileged pods so that they always are scheduled, even if cilium or other components add taints to the nodes.
Additionally moves cni/cns log collection steps to windows or linux specific scripts. The goal is that anyone can set their kubectx to a cluster, run the collection scripts with appropriate parameters and the logs will be downloaded automatically, even outside of pipeline environments.
The log checking script in the future may also be used to detect other known issues in the pipeline.
Issue Fixed:
See above
Requirements:
Notes:
Green: https://msazure.visualstudio.com/One/_build/results?buildId=147727074&view=results
Detect: https://msazure.visualstudio.com/One/_build/results?buildId=147893558&view=results