Create critical-op PDB on-demand to avoid false monitoring alerts#3024
Create critical-op PDB on-demand to avoid false monitoring alerts#3024a-thomas-22 wants to merge 3 commits intozalando:masterfrom
Conversation
|
Cannot start a pipeline due to: Click on pipeline status check Details link below for more information. |
The critical-op PodDisruptionBudget was previously created permanently, but its selector (critical-operation=true) matched no pods during normal operation. This caused false alerts in monitoring systems like kube-prometheus-stack because the PDB expected healthy pods but none matched. Changes: - Modified syncCriticalOpPodDisruptionBudget to check if any pods have the critical-operation label before creating/keeping the PDB - PDB is now created on-demand when pods are labeled (e.g., during major version upgrades) and deleted when labels are removed - Updated majorVersionUpgrade to explicitly create/delete the PDB around the critical operation for immediate protection - Removed automatic critical-op PDB creation from initial cluster setup - Added test to verify on-demand PDB creation and deletion behavior, including edge cases for idempotent create/delete operations The explicit PDB creation in majorVersionUpgrade ensures immediate protection before the critical operation starts. The sync function serves as a safety net for edge cases like bootstrap (where Patroni applies labels) or operator restarts during critical operations. Fixes zalando#3020
1caf79b to
513291c
Compare
|
Cannot start a pipeline due to: Click on pipeline status check Details link below for more information. |
|
Thanks for your contribution. We did not anticipate that such a PDB can cause these issue. We thought it's a smart to opt-in and outs to it if we have to 😃 Unit tests are currently failing. Can you fix them, please? |
When the PDB creation fails with "already exists" error, the pdb variable is nil since the initial Get failed. Using pdb.ObjectMeta would cause a panic. Use the cluster method to get the PDB name instead.
|
Cannot start a pipeline due to: Click on pipeline status check Details link below for more information. |
I'm not familiar with the CI here, but the gha unit tests and e2e tests are passing I think. The failures are from the internal Zalando CI (pipeline and script/build-postgres-operator). Build and tests also pass locally for me. I cant see the details of the failing runs. |
|
Cannot start a pipeline due to: Click on pipeline status check Details link below for more information. |
|
Is there anything that can be done to get this through? |
The critical-op PodDisruptionBudget was previously created permanently, but its selector (critical-operation=true) matched no pods during normal operation. This caused false alerts in monitoring systems like kube-prometheus-stack because the PDB expected healthy pods but none matched.
Changes:
The explicit PDB creation in majorVersionUpgrade ensures immediate protection before the critical operation starts. The sync function serves as a safety net for edge cases like bootstrap (where Patroni applies labels) or operator restarts during critical operations.
Fixes #3020