Skip to content

ci: fix known cniv1 pipeline issue and improve log collection#4183

Open
QxBytes wants to merge 19 commits intomasterfrom
alew/fix-pipeline-2
Open

ci: fix known cniv1 pipeline issue and improve log collection#4183
QxBytes wants to merge 19 commits intomasterfrom
alew/fix-pipeline-2

Conversation

@QxBytes
Copy link
Contributor

@QxBytes QxBytes commented Dec 26, 2025

Reason for Change:

There is a known issue in the pipeline for cniv1 during ip allocation. A symptom of this is "Initializing HTTP client with connection timeout" showing up in the cni logs. This PR adds a script to check the contents of the logs for these known phrases and marks the stage as succeeded with warnings if so. If the phrase is not found but there is an error, we fail out as normal.

Additionally adds tolerations to the privileged pods so that they always are scheduled, even if cilium or other components add taints to the nodes.

Additionally moves cni/cns log collection steps to windows or linux specific scripts. The goal is that anyone can set their kubectx to a cluster, run the collection scripts with appropriate parameters and the logs will be downloaded automatically, even outside of pipeline environments.

The log checking script in the future may also be used to detect other known issues in the pipeline.

Issue Fixed:

See above

Requirements:

Notes:
Green: https://msazure.visualstudio.com/One/_build/results?buildId=147727074&view=results
Detect: https://msazure.visualstudio.com/One/_build/results?buildId=147893558&view=results

@QxBytes QxBytes self-assigned this Dec 26, 2025
@QxBytes QxBytes added the cni Related to CNI. label Dec 26, 2025
@QxBytes QxBytes requested a review from a team as a code owner December 26, 2025 22:19
Copilot AI review requested due to automatic review settings December 26, 2025 22:19
@QxBytes QxBytes added the ci Infra or tooling. label Dec 26, 2025
@QxBytes
Copy link
Contributor Author

QxBytes commented Dec 26, 2025

/azp run Azure Container Networking PR

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a known CNI v1 pipeline issue during IP allocation and improves log collection infrastructure. The changes introduce automated detection of known issues, enhance pod scheduling reliability, and refactor log collection into reusable scripts.

Key changes:

  • Adds tolerations to privileged DaemonSets to ensure scheduling on all nodes regardless of taints
  • Creates standalone log collection scripts for Linux and Windows that can be run both in pipelines and locally
  • Implements a warning handler job that checks for known error patterns in logs and marks stages as succeeded with issues when detected

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 21 comments.

Show a summary per file
File Description
test/integration/manifests/load/privileged-daemonset.yaml Adds broad toleration to ensure privileged pods schedule on all nodes
test/integration/manifests/load/privileged-daemonset-windows.yaml Adds broad toleration to Windows privileged pods
hack/scripts/collect-windows-logs.sh New reusable script for collecting Windows CNI/CNS logs
hack/scripts/collect-linux-logs.sh New reusable script for collecting Linux CNI/CNS logs
hack/scripts/check-cni-log-contents.sh New script to search logs for known issue patterns
.pipelines/templates/warning-handler-job-template.yaml New template for handling warnings when known issues are detected
.pipelines/templates/log-template.yaml Refactored to use new log collection scripts and added NNC description
.pipelines/singletenancy/aks/e2e-job-template.yaml Integrates warning handler for CNI v1 Linux jobs
.pipelines/singletenancy/azure-cni-overlay-stateless/azure-cni-overlay-stateless-e2e-step-template.yaml Adds verbose flag to datapath test

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@vipul-21
Copy link
Contributor

Approved, discussed offline about the comments. The issue only occurred in pipeline so far so we will be skipping it as it has been discussed with @tamilmani1989 as per @QxBytes.

@paulyufan2
Copy link
Contributor

/azp run Azure Container Networking PR

@paulyufan2 paulyufan2 enabled auto-merge January 22, 2026 18:43
@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@paulyufan2 paulyufan2 added this pull request to the merge queue Jan 22, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 22, 2026
@paulyufan2 paulyufan2 added this pull request to the merge queue Jan 23, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 23, 2026
@paulyufan2 paulyufan2 added this pull request to the merge queue Jan 28, 2026
github-merge-queue bot pushed a commit that referenced this pull request Jan 29, 2026
* initial modification

* patch files

* revert this after testing

* update cat as powershell may start in C:\hpc folder

* add verbose to stateless test since it timed out last time

* add nnc to debug logs

* continue on error in cniv1 case if we hit known issue

If the downloaded cni log contains Initializing HTTP client with connection timeout
If there is any other error, we fail the pipeline as normal right after the regular e2e step template finishes

* set cni variable to cniv1 for nodesubnet case

* revert after testing: force error

* fix cluster name

* add check cni log contents

* set message to look for in logs

* remove force fail

* move to template

* test unhappy path

* Revert "test unhappy path"

This reverts commit c2ee459.

* Revert "revert this after testing"

This reverts commit 476dc69.

* make both os privileged debug pods tolerate all taints

without the toleration the privileged ds may sit at zero desired
and will report as "successfully deployed"

* add comment
@QxBytes QxBytes removed this pull request from the merge queue due to a manual request Jan 29, 2026
@QxBytes
Copy link
Contributor Author

QxBytes commented Jan 29, 2026

This PR can't be merged in until the E2E pipelines stop warning on every stage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci Infra or tooling. cni Related to CNI.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants