Skip to content

Create critical-op PDB on-demand to avoid false monitoring alerts#3024

Open
a-thomas-22 wants to merge 3 commits intozalando:masterfrom
a-thomas-22:fix/critical-op-pdb-on-demand
Open

Create critical-op PDB on-demand to avoid false monitoring alerts#3024
a-thomas-22 wants to merge 3 commits intozalando:masterfrom
a-thomas-22:fix/critical-op-pdb-on-demand

Conversation

@a-thomas-22
Copy link

@a-thomas-22 a-thomas-22 commented Jan 2, 2026

The critical-op PodDisruptionBudget was previously created permanently, but its selector (critical-operation=true) matched no pods during normal operation. This caused false alerts in monitoring systems like kube-prometheus-stack because the PDB expected healthy pods but none matched.

Changes:

  • Modified syncCriticalOpPodDisruptionBudget to check if any pods have the critical-operation label before creating/keeping the PDB
  • PDB is now created on-demand when pods are labeled (e.g., during major version upgrades) and deleted when labels are removed
  • Updated majorVersionUpgrade to explicitly create/delete the PDB around the critical operation for immediate protection
  • Removed automatic critical-op PDB creation from initial cluster setup
  • Added test to verify on-demand PDB creation and deletion behavior

The explicit PDB creation in majorVersionUpgrade ensures immediate protection before the critical operation starts. The sync function serves as a safety net for edge cases like bootstrap (where Patroni applies labels) or operator restarts during critical operations.

Fixes #3020

@zalando-robot
Copy link

Cannot start a pipeline due to:

No accountable user for this pipeline: no Zalando employee associated to this GitHub username

Click on pipeline status check Details link below for more information.

The critical-op PodDisruptionBudget was previously created permanently,
but its selector (critical-operation=true) matched no pods during normal
operation. This caused false alerts in monitoring systems like
kube-prometheus-stack because the PDB expected healthy pods but none
matched.

Changes:
- Modified syncCriticalOpPodDisruptionBudget to check if any pods have
  the critical-operation label before creating/keeping the PDB
- PDB is now created on-demand when pods are labeled (e.g., during
  major version upgrades) and deleted when labels are removed
- Updated majorVersionUpgrade to explicitly create/delete the PDB
  around the critical operation for immediate protection
- Removed automatic critical-op PDB creation from initial cluster setup
- Added test to verify on-demand PDB creation and deletion behavior,
  including edge cases for idempotent create/delete operations

The explicit PDB creation in majorVersionUpgrade ensures immediate
protection before the critical operation starts. The sync function
serves as a safety net for edge cases like bootstrap (where Patroni
applies labels) or operator restarts during critical operations.

Fixes zalando#3020
@a-thomas-22 a-thomas-22 force-pushed the fix/critical-op-pdb-on-demand branch from 1caf79b to 513291c Compare January 2, 2026 21:05
@zalando-robot
Copy link

Cannot start a pipeline due to:

No accountable user for this pipeline: no Zalando employee associated to this GitHub username

Click on pipeline status check Details link below for more information.

@a-thomas-22 a-thomas-22 marked this pull request as ready for review January 2, 2026 21:08
@FxKu FxKu added the minor label Jan 8, 2026
@FxKu FxKu added this to the 1.15.2 milestone Jan 8, 2026
@FxKu FxKu moved this to Waiting for review in Postgres Operator Jan 8, 2026
@FxKu
Copy link
Member

FxKu commented Jan 9, 2026

Thanks for your contribution. We did not anticipate that such a PDB can cause these issue. We thought it's a smart to opt-in and outs to it if we have to 😃

Unit tests are currently failing. Can you fix them, please?

When the PDB creation fails with "already exists" error, the pdb
variable is nil since the initial Get failed. Using pdb.ObjectMeta
would cause a panic. Use the cluster method to get the PDB name instead.
@zalando-robot
Copy link

Cannot start a pipeline due to:

No accountable user for this pipeline: no Zalando employee associated to this GitHub username

Click on pipeline status check Details link below for more information.

@a-thomas-22
Copy link
Author

Thanks for your contribution. We did not anticipate that such a PDB can cause these issue. We thought it's a smart to opt-in and outs to it if we have to 😃

Unit tests are currently failing. Can you fix them, please?

I'm not familiar with the CI here, but the gha unit tests and e2e tests are passing I think. The failures are from the internal Zalando CI (pipeline and script/build-postgres-operator). Build and tests also pass locally for me. I cant see the details of the failing runs.

@zalando-robot
Copy link

Cannot start a pipeline due to:

No accountable user for this pipeline: no Zalando employee associated to this GitHub username

Click on pipeline status check Details link below for more information.

@vquie
Copy link

vquie commented Feb 2, 2026

Is there anything that can be done to get this through?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: Waiting for review

Development

Successfully merging this pull request may close these issues.

Newly introduced critical-op PDB causes tons of alerts with kube-prometheus-stack

4 participants