ci: fix known cniv1 pipeline issue and improve log collection by QxBytes · Pull Request #4183 · Azure/azure-container-networking

QxBytes · 2025-12-26T22:19:11Z

Reason for Change:

There is a known issue in the pipeline for cniv1 during ip allocation. A symptom of this is "Initializing HTTP client with connection timeout" showing up in the cni logs. This PR adds a script to check the contents of the logs for these known phrases and marks the stage as succeeded with warnings if so. If the phrase is not found but there is an error, we fail out as normal.

Additionally adds tolerations to the privileged pods so that they always are scheduled, even if cilium or other components add taints to the nodes.

Additionally moves cni/cns log collection steps to windows or linux specific scripts. The goal is that anyone can set their kubectx to a cluster, run the collection scripts with appropriate parameters and the logs will be downloaded automatically, even outside of pipeline environments.

The log checking script in the future may also be used to detect other known issues in the pipeline.

Issue Fixed:

See above

Requirements:

uses conventional commit messages
includes documentation
adds unit tests
relevant PR labels added

Notes:
Green: https://msazure.visualstudio.com/One/_build/results?buildId=147727074&view=results
Detect: https://msazure.visualstudio.com/One/_build/results?buildId=147893558&view=results

If the downloaded cni log contains Initializing HTTP client with connection timeout If there is any other error, we fail the pipeline as normal right after the regular e2e step template finishes

This reverts commit c2ee459.

This reverts commit 476dc69.

without the toleration the privileged ds may sit at zero desired and will report as "successfully deployed"

QxBytes · 2025-12-26T22:19:37Z

/azp run Azure Container Networking PR

azure-pipelines · 2025-12-26T22:19:47Z

Azure Pipelines successfully started running 1 pipeline(s).

Copilot

Pull request overview

This PR addresses a known CNI v1 pipeline issue during IP allocation and improves log collection infrastructure. The changes introduce automated detection of known issues, enhance pod scheduling reliability, and refactor log collection into reusable scripts.

Key changes:

Adds tolerations to privileged DaemonSets to ensure scheduling on all nodes regardless of taints
Creates standalone log collection scripts for Linux and Windows that can be run both in pipelines and locally
Implements a warning handler job that checks for known error patterns in logs and marks stages as succeeded with issues when detected

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 21 comments.

Show a summary per file

File	Description
test/integration/manifests/load/privileged-daemonset.yaml	Adds broad toleration to ensure privileged pods schedule on all nodes
test/integration/manifests/load/privileged-daemonset-windows.yaml	Adds broad toleration to Windows privileged pods
hack/scripts/collect-windows-logs.sh	New reusable script for collecting Windows CNI/CNS logs
hack/scripts/collect-linux-logs.sh	New reusable script for collecting Linux CNI/CNS logs
hack/scripts/check-cni-log-contents.sh	New script to search logs for known issue patterns
.pipelines/templates/warning-handler-job-template.yaml	New template for handling warnings when known issues are detected
.pipelines/templates/log-template.yaml	Refactored to use new log collection scripts and added NNC description
.pipelines/singletenancy/aks/e2e-job-template.yaml	Integrates warning handler for CNI v1 Linux jobs
.pipelines/singletenancy/azure-cni-overlay-stateless/azure-cni-overlay-stateless-e2e-step-template.yaml	Adds verbose flag to datapath test

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

hack/scripts/collect-windows-logs.sh

hack/scripts/collect-linux-logs.sh

.pipelines/templates/warning-handler-job-template.yaml

test/integration/manifests/load/privileged-daemonset-windows.yaml

hack/scripts/collect-linux-logs.sh

hack/scripts/collect-windows-logs.sh

hack/scripts/collect-linux-logs.sh

test/integration/manifests/load/privileged-daemonset-windows.yaml

.pipelines/singletenancy/aks/e2e-job-template.yaml

hack/scripts/collect-windows-logs.sh

.pipelines/templates/warning-handler-job-template.yaml

vipul-21 · 2026-01-12T21:42:43Z

Approved, discussed offline about the comments. The issue only occurred in pipeline so far so we will be skipping it as it has been discussed with @tamilmani1989 as per @QxBytes.

paulyufan2 · 2026-01-22T18:43:50Z

/azp run Azure Container Networking PR

azure-pipelines · 2026-01-22T18:44:02Z

Azure Pipelines successfully started running 1 pipeline(s).

* initial modification * patch files * revert this after testing * update cat as powershell may start in C:\hpc folder * add verbose to stateless test since it timed out last time * add nnc to debug logs * continue on error in cniv1 case if we hit known issue If the downloaded cni log contains Initializing HTTP client with connection timeout If there is any other error, we fail the pipeline as normal right after the regular e2e step template finishes * set cni variable to cniv1 for nodesubnet case * revert after testing: force error * fix cluster name * add check cni log contents * set message to look for in logs * remove force fail * move to template * test unhappy path * Revert "test unhappy path" This reverts commit c2ee459. * Revert "revert this after testing" This reverts commit 476dc69. * make both os privileged debug pods tolerate all taints without the toleration the privileged ds may sit at zero desired and will report as "successfully deployed" * add comment

QxBytes · 2026-01-29T17:09:12Z

This PR can't be merged in until the E2E pipelines stop warning on every stage

QxBytes added 18 commits December 24, 2025 15:38

initial modification

084887d

patch files

da9370f

revert this after testing

476dc69

update cat as powershell may start in C:\hpc folder

25b1e6b

add verbose to stateless test since it timed out last time

56869a1

add nnc to debug logs

549586a

continue on error in cniv1 case if we hit known issue

be7a35c

If the downloaded cni log contains Initializing HTTP client with connection timeout If there is any other error, we fail the pipeline as normal right after the regular e2e step template finishes

set cni variable to cniv1 for nodesubnet case

46785ee

revert after testing: force error

e025323

fix cluster name

4fd2caf

add check cni log contents

5780814

set message to look for in logs

08ccf36

remove force fail

df60a3d

move to template

25b69ec

test unhappy path

4529d06

Revert "test unhappy path"

c6e449c

This reverts commit c2ee459.

Revert "revert this after testing"

98b23a9

This reverts commit 476dc69.

make both os privileged debug pods tolerate all taints

34ffe36

without the toleration the privileged ds may sit at zero desired and will report as "successfully deployed"

QxBytes self-assigned this Dec 26, 2025

QxBytes added the cni Related to CNI. label Dec 26, 2025

QxBytes requested a review from a team as a code owner December 26, 2025 22:19

Copilot AI review requested due to automatic review settings December 26, 2025 22:19

QxBytes added the ci Infra or tooling. label Dec 26, 2025

QxBytes requested a review from karina-ranadive December 26, 2025 22:19

Copilot started reviewing on behalf of QxBytes December 26, 2025 22:19 View session

Copilot AI reviewed Dec 26, 2025

View reviewed changes

vipul-21 requested changes Dec 31, 2025

View reviewed changes

add comment

0c6ab9a

jpayne3506 reviewed Jan 5, 2026

View reviewed changes

.pipelines/templates/warning-handler-job-template.yaml Show resolved Hide resolved

vipul-21 approved these changes Jan 12, 2026

View reviewed changes

tamilmani1989 approved these changes Jan 13, 2026

View reviewed changes

paulyufan2 enabled auto-merge January 22, 2026 18:43

paulyufan2 added this pull request to the merge queue Jan 22, 2026

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 22, 2026

paulyufan2 added this pull request to the merge queue Jan 23, 2026

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 23, 2026

paulyufan2 added this pull request to the merge queue Jan 28, 2026

QxBytes removed this pull request from the merge queue due to a manual request Jan 29, 2026

Conversation

QxBytes commented Dec 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

QxBytes commented Dec 26, 2025

Uh oh!

azure-pipelines bot commented Dec 26, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vipul-21 commented Jan 12, 2026

Uh oh!

paulyufan2 commented Jan 22, 2026

Uh oh!

azure-pipelines bot commented Jan 22, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

QxBytes commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

QxBytes commented Dec 26, 2025 •

edited

Loading