From 0b178f63b6d0748b0d6835160ccb626e78534bbc Mon Sep 17 00:00:00 2001 From: Lawton Mizell Date: Mon, 26 Jan 2026 16:48:09 -0500 Subject: [PATCH] OCM-19740 - local observability --- docs/development.md | 20 ++ docs/metrics-tracing.md | 642 ++++++++++++++++++++++++++++++++++++++++ docs/metrics.md | 66 +++++ 3 files changed, 728 insertions(+) create mode 100644 docs/metrics-tracing.md diff --git a/docs/development.md b/docs/development.md index 13296fd1..b05b27b7 100644 --- a/docs/development.md +++ b/docs/development.md @@ -17,6 +17,7 @@ - [Remotely](#remotely) - [Trigger Reconcile](#trigger-reconcile) - [Monitoring ongoing upgrade](#monitoring-ongoing-upgrade) + - [Viewing Metrics Locally](#viewing-metrics-locally) - [Maintenance](#maintenance) This document should entail all you need to develop this operator locally. @@ -341,6 +342,25 @@ oc get clusterversion -w oc logs cluster-version-operator- -n openshift-cluster-version -f ``` +## Viewing Metrics Locally + +When running the operator locally, Prometheus metrics are exposed at `http://localhost:8383/metrics`. + +```shell +# In one terminal, run the operator +make run-standard-routes + +# In another terminal, view metrics +curl http://localhost:8383/metrics + +# Filter to operator-specific metrics +curl -s http://localhost:8383/metrics | grep upgradeoperator +``` + +For details on what each metric means and when it fires, see: +- [metrics.md](metrics.md) - Metric reference and alert mappings +- [metrics-tracing.md](metrics-tracing.md) - Where metrics are triggered in code + ## Maintenance We can leverage the script for maintenance: diff --git a/docs/metrics-tracing.md b/docs/metrics-tracing.md new file mode 100644 index 00000000..70d9dbc6 --- /dev/null +++ b/docs/metrics-tracing.md @@ -0,0 +1,642 @@ +# Managed Upgrade Operator - Metrics Tracing Guide + +This document provides a comprehensive mapping of all Prometheus metrics defined in `pkg/metrics/metrics.go`, showing where they are triggered throughout the codebase and explaining why each metric fires. + +## Table of Contents + +1. [Overview](#overview) +2. [Validation Metrics](#validation-metrics) +3. [Scaling Metrics](#scaling-metrics) +4. [Upgrade Window Metrics](#upgrade-window-metrics) +5. [Control Plane Timeout Metrics](#control-plane-timeout-metrics) +6. [Health Check Metrics](#health-check-metrics) +7. [Worker Timeout Metrics](#worker-timeout-metrics) +8. [Node Drain Metrics](#node-drain-metrics) +9. [Notification Metrics](#notification-metrics) +10. [Timestamp Metrics](#timestamp-metrics) +11. [Upgrade Result Metrics](#upgrade-result-metrics) +12. [Reset Operations](#reset-operations) +13. [E2E Test Coverage](#e2e-test-coverage) + +## Overview + +The Managed Upgrade Operator exposes **17 Prometheus metrics** organized into two categories: + +- **Ephemeral Metrics (16)**: Reset when upgrade completes or UpgradeConfig is deleted +- **Persistent Metrics (1)**: Retained across upgrades to track historical results + +All metrics use the `upgradeoperator` subsystem prefix (except `upgrade_notification_failed`). + +--- + +## Validation Metrics + +### `upgradeoperator_upgradeconfig_validation_failed` + +**Metric Type**: GaugeVec +**Labels**: `upgradeconfig_name` +**Values**: `1` = failed, `0` = succeeded + +#### Set to 1 (Failed) +- **File**: `controllers/upgradeconfig/upgradeconfig_controller.go:209` +- **Method**: `UpdateMetricValidationFailed(instance.Name)` +- **Trigger**: `validator.IsValidUpgradeConfig()` returns invalid or error +- **Why**: UpgradeConfig CR has validation issues: + - Invalid `upgradeAt` time format + - Missing required fields + - Invalid channel or version + - Scheduling conflicts + +#### Set to 0 (Succeeded) +- **File**: `controllers/upgradeconfig/upgradeconfig_controller.go:213` +- **Method**: `UpdateMetricValidationSucceeded(instance.Name)` +- **Trigger**: Validation passes successfully +- **Why**: UpgradeConfig is valid and ready to schedule + +**Alert**: `UpgradeConfigValidationFailedSRE` (paging) + +--- + +## Scaling Metrics + +### `upgradeoperator_scaling_failed` + +**Metric Type**: GaugeVec +**Labels**: `upgradeconfig_name` +**Values**: `1` = failed, `0` = succeeded + +#### Set to 1 (Failed) +- **File**: `pkg/upgraders/scalerstep.go:51` +- **Method**: `UpdateMetricScalingFailed(c.upgradeConfig.Name)` +- **Trigger**: `scaler.EnsureScaleUpNodes()` returns `ScaleTimeOutError` +- **Why**: Pre-upgrade capacity reservation failed + - Extra worker nodes didn't become Ready in time + - Timeout defined by `config.Scale.TimeOut` + - Critical for ensuring customer capacity during upgrades + +**Context**: Only applies when `spec.capacityReservation: true` + +#### Set to 0 (Succeeded) +- **File**: `pkg/upgraders/scalerstep.go:62` +- **Method**: `UpdateMetricScalingSucceeded(c.upgradeConfig.Name)` +- **Trigger**: `scaler.EnsureScaleUpNodes()` completes successfully +- **Why**: Extra worker nodes scaled up and are Ready + +--- + +## Upgrade Window Metrics + +### `upgradeoperator_upgrade_window_breached` + +**Metric Type**: GaugeVec +**Labels**: `upgradeconfig_name` +**Values**: `1` = breached, `0` = not breached + +#### Set to 1 (Breached) +- **File**: `pkg/upgraders/osdupgrader.go:170` +- **Method**: `UpdateMetricUpgradeWindowBreached(upgradeConfig.Name)` +- **Trigger**: Upgrade didn't complete within maintenance window +- **Why**: + - Upgrade exceeded `GetUpgradeWindowTimeOutDuration()` + - Maintenance window closed before upgrade finished + - SRE intervention likely required + +**Alert**: Triggers escalation for incomplete upgrades + +#### Set to 0 (Not Breached) +- **File**: `pkg/upgraders/controlplanestep.go:19` +- **Method**: `UpdateMetricUpgradeWindowNotBreached(c.upgradeConfig.Name)` +- **Trigger**: `CommenceUpgrade()` called successfully +- **Why**: Upgrade is proceeding within the allowed window + +--- + +## Control Plane Timeout Metrics + +### `upgradeoperator_controlplane_timeout` + +**Metric Type**: GaugeVec +**Labels**: `upgradeconfig_name`, `version` +**Values**: `1` = timeout, `0` = no timeout + +#### Set to 1 (Timeout) +- **File**: `pkg/upgraders/controlplanestep.go:80` +- **Method**: `UpdateMetricUpgradeControlPlaneTimeout(c.upgradeConfig.Name, c.upgradeConfig.Spec.Desired.Version)` +- **Trigger**: Control plane upgrade exceeds timeout +- **Condition**: `time.Now().After(upgradeStartTime.Add(upgradeTimeout))` +- **Timeout**: `config.Maintenance.GetControlPlaneDuration()` +- **Why**: Control plane components stuck: + - Cluster Version Operator (CVO) issues + - Master node upgrades hanging + - API server availability problems + +**Alert**: `UpgradeControlPlaneUpgradeTimeoutSRE` (paging) + +#### Set to 0 (Success) +- **File**: `pkg/upgraders/controlplanestep.go:59` +- **Method**: `ResetMetricUpgradeControlPlaneTimeout(c.upgradeConfig.Name, c.upgradeConfig.Spec.Desired.Version)` +- **Trigger**: `cvClient.HasUpgradeCompleted()` returns true for control plane +- **Why**: Masters and CVO upgraded successfully within timeout + +--- + +## Health Check Metrics + +### `upgradeoperator_healthcheck_failed` + +**Metric Type**: GaugeVec +**Labels**: `upgradeconfig_name`, `state`, `version`, `reason` +**Values**: `1` = failed, `0` = succeeded + +**Health Check Reasons** (from `pkg/metrics/metrics.go:58-70`): +- `healthcheck_query_failed` - Cannot query Prometheus +- `critical_alerts_firing` - Critical alerts active +- `cluster_operators_degraded` - ClusterOperators not healthy +- `cluster_operator_status_failed` - Cannot query ClusterOperator status +- `default_worker_machinepool_not_found` - Missing worker MachinePool +- `cluster_node_query_failed` - Cannot query nodes +- `cluster_node_manually_cordoned` - Nodes manually cordoned +- `cluster_node_taint_unschedulable` - Nodes tainted unschedulable +- `cluster_invalid_pdb` - Invalid PodDisruptionBudgets +- `cluster_invalid_pdb_configuration` - PDB config issues +- `pdb_query_failed` - Cannot query PDBs +- `dvo_client_creation_failed` - Cannot create DVO client +- `dvo_metrics_query_failed` - Cannot query DVO metrics + +#### Set to 1 (Failed) +- **File**: `pkg/upgraders/healthcheckstep.go` (multiple validators) +- **Methods**: Various health check implementations +- **Trigger**: Any pre-upgrade or post-upgrade health check fails +- **Why**: Cluster state is unhealthy for upgrade + - Critical alerts prevent safe upgrade + - Degraded operators may worsen during upgrade + - Invalid PDBs could block node drains + - Manual interventions detected + +**Alert**: `UpgradeClusterCheckFailedSRE` (paging) + +#### Set to 0 (Succeeded) +- **File**: `pkg/upgraders/healthcheckstep.go` (multiple validators) +- **Trigger**: Health checks pass +- **Why**: Cluster is healthy and ready for upgrade + +--- + +## Worker Timeout Metrics + +### `upgradeoperator_worker_timeout` + +**Metric Type**: GaugeVec +**Labels**: `upgradeconfig_name`, `version` +**Values**: `1` = timeout, `0` = no timeout + +#### Set to 1 (Timeout) +- **File**: `pkg/upgraders/workerstep.go:28` +- **Method**: `UpdateMetricUpgradeWorkerTimeout(c.upgradeConfig.Name, c.upgradeConfig.Spec.Desired.Version)` +- **Trigger**: Workers still upgrading AND no active maintenance window +- **Condition**: + - `upgradingResult.IsUpgrading == true` + - `silenceActive == false` +- **Why**: Worker upgrades taking too long outside maintenance + - Node drain issues + - Machine rollout problems + - MachineConfigPool stuck + +**Alert**: `UpgradeNodeUpgradeTimeoutSRE` (paging) + +#### Set to 0 (No Timeout) +- **File**: `pkg/upgraders/workerstep.go:31` +- **Method**: `ResetMetricUpgradeWorkerTimeout(c.upgradeConfig.Name, c.upgradeConfig.Spec.Desired.Version)` +- **Trigger**: Workers still upgrading BUT maintenance window is active +- **Why**: Upgrade is progressing within allowed timeframe + +#### Also Set to 0 (Completed) +- **File**: `pkg/upgraders/workerstep.go:44` +- **Method**: `ResetMetricUpgradeWorkerTimeout(c.upgradeConfig.Name, c.upgradeConfig.Spec.Desired.Version)` +- **Trigger**: All workers upgraded successfully +- **Why**: Worker upgrade phase completed + +--- + +## Node Drain Metrics + +### `upgradeoperator_node_drain_timeout` + +**Metric Type**: GaugeVec +**Labels**: `node_name` +**Values**: `1` = timeout, `0` = no timeout + +#### Set to 1 (Timeout) - NodeKeeper Controller +- **File**: `controllers/nodekeeper/nodekeeper_controller.go:161` +- **Method**: `UpdateMetricNodeDrainFailed(node.Name)` +- **Trigger**: Node drain timed out during upgrade +- **Conditions**: + - `drainStrategy.HasFailed() == true` + - `node.DeletionTimestamp == nil` + - `Machinery.IsNodeUpgrading(node) == true` +- **Why**: Node cannot evict all pods in time + - PodDisruptionBudget blocking eviction + - Pods without proper controllers + - Volume detachment issues + - Finalizers blocking pod deletion + +**Alert**: `UpgradeNodeDrainFailedSRE` (paging) + +#### Set to 1 (Timeout) - Scale Down +- **File**: `pkg/upgraders/scalerstep.go:99` +- **Method**: `UpdateMetricNodeDrainFailed(dtErr.GetNodeName())` +- **Trigger**: Extra capacity node failed to drain during scale-down +- **Condition**: `scaler.IsDrainTimeOutError(err) == true` +- **Why**: Cannot remove extra nodes after upgrade + +#### Set to 0 (Success/Not Applicable) - NodeKeeper +- **File**: `controllers/nodekeeper/nodekeeper_controller.go:87` +- **Method**: `ResetMetricNodeDrainFailed(node.Name)` +- **Trigger**: Node is not cordoned +- **Why**: Node not undergoing drain operation + +- **File**: `controllers/nodekeeper/nodekeeper_controller.go:156` +- **Method**: `ResetMetricNodeDrainFailed(node.Name)` +- **Trigger**: Node has DeletionTimestamp set +- **Why**: Node being deleted, drain metric no longer relevant + +- **File**: `controllers/nodekeeper/nodekeeper_controller.go:165` +- **Method**: `ResetMetricNodeDrainFailed(node.Name)` +- **Trigger**: Drain succeeded +- **Why**: Node drained successfully + +#### Reset All Nodes +- **File**: `pkg/upgraders/scalerstep.go:106` +- **Method**: `ResetAllMetricNodeDrainFailed()` +- **Trigger**: All extra scaled nodes removed successfully +- **Why**: Scale-down phase completed, clear all drain metrics + +--- + +## Notification Metrics + +### `upgrade_notification_failed` + +**Metric Type**: GaugeVec +**Labels**: `upgradeconfig_name`, `event` +**Values**: `1` = failed, `0` = succeeded + +#### Set to 1 (Failed) +- **File**: `pkg/eventmanager/eventmanager.go:157` +- **Method**: `UpdatemetricUpgradeNotificationFailed(uc.Name, string(state))` +- **Trigger**: `notifier.NotifyState()` returns error +- **Why**: Failed to send notification + - OCM API unavailable + - Network connectivity issues + - Service log API errors + - Authentication failures + +#### Set to 0 (Succeeded) +- **File**: `pkg/eventmanager/eventmanager.go:160` +- **Method**: `UpdatemetricUpgradeNotificationSucceeded(uc.Name, string(state))` +- **Trigger**: Notification sent successfully +- **Why**: Event notification delivered to OCM/ServiceLog + +### `upgradeoperator_upgrade_notification` + +**Metric Type**: GaugeVec +**Labels**: `upgradeconfig_name`, `event`, `version` +**Values**: `1` = sent + +**Event Types** (from `pkg/notifier`): +- `scheduled` - Upgrade scheduled +- `started` - Upgrade started +- `control_plane_started` - Control plane upgrade started +- `control_plane_completed` - Control plane upgrade completed +- `workers_started` - Worker upgrade started +- `workers_completed` - Worker upgrade completed +- `completed` - Upgrade completed +- `delayed` - Node drain delayed +- `skipped` - Upgrade skipped (scaling failed) + +#### Set to 1 (Event Sent) - State Notifications +- **File**: `pkg/eventmanager/eventmanager.go:161` +- **Method**: `UpdateMetricNotificationEventSent(uc.Name, string(state), uc.Spec.Desired.Version)` +- **Trigger**: After successful state notification +- **Why**: Tracking which lifecycle events have been sent + +#### Set to 1 (Event Sent) - Result Notifications +- **File**: `pkg/eventmanager/eventmanager.go:201` +- **Method**: `UpdateMetricNotificationEventSent(uc.Name, string(state), uc.Spec.Desired.Version)` +- **Trigger**: After sending upgrade result notification +- **Why**: Tracking upgrade outcome notifications + +#### Set to 1 (Event Sent) - Delayed Notifications +- **File**: `pkg/drain/nodeDrainStrategy.go:92` +- **Method**: `UpdateMetricNotificationEventSent(ds.uc.Name, string(notifier.MuoStateDelayed), ds.uc.Spec.Desired.Version)` +- **Trigger**: Node drain taking longer than expected +- **Why**: Proactively notify about upgrade delays + +**Usage**: Query to check if specific notification already sent to avoid duplicates + +--- + +## Timestamp Metrics + +These metrics track the upgrade lifecycle timeline. All values are Unix timestamps. + +### `upgradeoperator_upgradeconfig_sync_timestamp` + +**Metric Type**: GaugeVec +**Labels**: `upgradeconfig_name` +**Values**: Unix timestamp + +- **File**: `pkg/upgradeconfigmanager/upgradeconfigmanager.go:168` +- **Method**: `UpdateMetricUpgradeConfigSyncTimestamp(UPGRADECONFIG_CR_NAME, time.Now())` +- **Trigger**: UpgradeConfig successfully synced from OCM +- **Why**: Track when upgrade policy was last synchronized +- **Purpose**: Monitor sync frequency and detect stale policies + +### `upgradeoperator_upgrade_started_timestamp` + +**Metric Type**: GaugeVec +**Labels**: `_id` (cluster ID), `upgradeconfig_name`, `version` +**Values**: Unix timestamp + +- **File**: `pkg/upgraders/notifierstep.go:32` +- **Method**: `UpdateMetricUpgradeStartedTimestamp(clusterid, c.upgradeConfig.Name, c.upgradeConfig.Spec.Desired.Version, time.Now())` +- **Trigger**: Upgrade officially starts (scheduled → started transition) +- **Why**: Mark beginning of upgrade process +- **Purpose**: Calculate total upgrade duration + +### `upgradeoperator_upgrade_completed_timestamp` + +**Metric Type**: GaugeVec +**Labels**: `_id` (cluster ID), `upgradeconfig_name`, `version` +**Values**: Unix timestamp + +- **File**: `pkg/upgraders/notifierstep.go:45` +- **Method**: `UpdateMetricUpgradeCompletedTimestamp(clusterid, c.upgradeConfig.Name, c.upgradeConfig.Spec.Desired.Version, time.Now())` +- **Trigger**: Entire upgrade completes successfully +- **Why**: Mark end of upgrade process +- **Purpose**: Calculate total upgrade duration + +### `upgradeoperator_controlplane_upgrade_started_timestamp` + +**Metric Type**: GaugeVec +**Labels**: `_id` (cluster ID), `upgradeconfig_name`, `version` +**Values**: Unix timestamp + +- **File**: `pkg/upgraders/controlplanestep.go:35` +- **Method**: `UpdateMetricControlplaneUpgradeStartedTimestamp(clusterid, c.upgradeConfig.Name, c.upgradeConfig.Spec.Desired.Version, time.Now())` +- **Trigger**: Control plane upgrade begins via `CommenceUpgrade()` +- **Why**: Mark when CVO starts upgrading masters +- **Purpose**: Calculate control plane upgrade duration + +### `upgradeoperator_controlplane_upgrade_completed_timestamp` + +**Metric Type**: GaugeVec +**Labels**: `_id` (cluster ID), `upgradeconfig_name`, `version` +**Values**: Unix timestamp + +- **File**: `pkg/upgraders/controlplanestep.go:61` +- **Method**: `UpdateMetricControlplaneUpgradeCompletedTimestamp(clusterid, c.upgradeConfig.Name, c.upgradeConfig.Spec.Desired.Version, time.Now())` +- **Trigger**: Control plane upgrade completes +- **Why**: Mark when all masters and CVO are upgraded +- **Purpose**: Calculate control plane upgrade duration + +### `upgradeoperator_workernode_upgrade_started_timestamp` + +**Metric Type**: GaugeVec +**Labels**: `_id` (cluster ID), `upgradeconfig_name`, `version` +**Values**: Unix timestamp + +- **File**: `pkg/upgraders/controlplanestep.go:62` +- **Method**: `UpdateMetricWorkernodeUpgradeStartedTimestamp(clusterid, c.upgradeConfig.Name, c.upgradeConfig.Spec.Desired.Version, time.Now())` +- **Trigger**: Immediately after control plane completes (same method call) +- **Why**: Mark transition from control plane to worker upgrade phase +- **Purpose**: Calculate worker upgrade duration + +### `upgradeoperator_workernode_upgrade_completed_timestamp` + +**Metric Type**: GaugeVec +**Labels**: `_id` (cluster ID), `upgradeconfig_name`, `version` +**Values**: Unix timestamp + +- **File**: `pkg/upgraders/workerstep.go:42` +- **Method**: `UpdateMetricWorkernodeUpgradeCompletedTimestamp(clusterid, c.upgradeConfig.Name, c.upgradeConfig.Spec.Desired.Version, time.Now())` +- **Trigger**: All worker nodes upgraded +- **Why**: Mark completion of worker node upgrade phase +- **Purpose**: Calculate worker upgrade duration + +**Analysis**: Calculate phase durations with Prometheus queries: +```promql +# Total upgrade duration +upgradeoperator_upgrade_completed_timestamp - upgradeoperator_upgrade_started_timestamp + +# Control plane duration +upgradeoperator_controlplane_upgrade_completed_timestamp - upgradeoperator_controlplane_upgrade_started_timestamp + +# Worker duration +upgradeoperator_workernode_upgrade_completed_timestamp - upgradeoperator_workernode_upgrade_started_timestamp +``` + +--- + +## Upgrade Result Metrics + +### `upgradeoperator_upgrade_result` + +**Metric Type**: GaugeVec (PERSISTENT - not reset between upgrades) +**Labels**: `upgradeconfig_name`, `preceding_version`, `stream`, `version`, `alerts` +**Values**: `1` = success (no alerts), `0` = failure (alerts fired) + +- **File**: `controllers/upgradeconfig/upgradeconfig_controller.go:326` +- **Method**: `UpdateMetricUpgradeResult(name, precedingVersion, version, minorUpgrade, upgradeAlerts)` +- **Trigger**: After upgrade completes, when recording final outcome +- **Why**: Permanent record of upgrade result +- **Data Captured**: + - **preceding_version**: Version before upgrade (e.g., "4.14.0") + - **version**: Target version (e.g., "4.15.0") + - **stream**: Upgrade type + - `"y"` = y-stream (minor version upgrade, e.g., 4.14 → 4.15) + - `"z"` = z-stream (patch upgrade, e.g., 4.15.1 → 4.15.2) + - **alerts**: Comma-separated list of paging alerts that fired during upgrade + - **value**: + - `1` = clean upgrade (no paging alerts) + - `0` = problematic upgrade (paging alerts fired) + +**Paging Alerts Tracked** (from `pkg/metrics/metrics.go:74-81`): +- `UpgradeConfigValidationFailedSRE` +- `UpgradeClusterCheckFailedSRE` +- `UpgradeControlPlaneUpgradeTimeoutSRE` +- `UpgradeNodeUpgradeTimeoutSRE` +- `UpgradeNodeDrainFailedSRE` + +**Purpose**: Historical analysis of upgrade success rates by version and alert patterns + +--- + +## Reset Operations + +### `ResetEphemeralMetrics()` + +**Method**: Clears ALL ephemeral metrics + +- **File**: `controllers/upgradeconfig/upgradeconfig_controller.go:85` +- **Trigger**: When UpgradeConfig CR is deleted +- **Why**: Clean up temporary metrics after upgrade completion +- **Metrics Reset** (16 total): + - `metricValidationFailed` + - `metricScalingFailed` + - `metricUpgradeWindowBreached` + - `metricUpgradeControlPlaneTimeout` + - `metricHealthcheckFailed` + - `metricUpgradeWorkerTimeout` + - `metricNodeDrainFailed` + - `metricUpgradeNotification` + - `metricUpgradeConfigSyncTimestamp` + - `metricUpgradeNotificationFailed` + - `upgradeStartedTimestamp` + - `upgradeCompletedTimestamp` + - `controlplaneUpgradeStartedTimestamp` + - `controlplaneUpgradeCompletedTimestamp` + - `workernodeUpgradeStartedTimestamp` + - `workernodeUpgradeCompletedTimestamp` + +**Not Reset**: `metricUpgradeResult` (persistent metric) + +### `ResetFailureMetrics()` + +**Method**: Clears failure-indicating metrics before retry + +- **File**: `pkg/upgraders/osdupgrader.go:173` +- **Trigger**: When starting a new upgrade attempt +- **Why**: Clear previous failure indicators to allow fresh attempt +- **Metrics Reset** (9 total): + - `metricValidationFailed` + - `metricScalingFailed` + - `metricUpgradeControlPlaneTimeout` + - `metricHealthcheckFailed` + - `metricUpgradeWorkerTimeout` + - `metricNodeDrainFailed` + - `metricUpgradeNotification` + - `metricUpgradeNotificationFailed` + - `upgradeStartedTimestamp` + +**Context**: Called when upgrade window is breached, preparing for retry + +--- + +## E2E Test Coverage + +### Covered Metrics (1/17) + +✅ **`upgradeoperator_upgradeconfig_validation_failed`** +- **Test**: `test/e2e/managed_upgrade_operator_tests.go:138-156` +- **Test Case**: "should raise prometheus metric if start time is invalid" +- **Coverage**: + - Creates UpgradeConfig with invalid `upgradeAt` value + - Polls Prometheus to verify metric appears + - Validates metric value equals 1 + +### NOT Covered by E2E Tests (16/17) + +The following metrics lack e2e test coverage: + +**Scaling Metrics** +- ❌ `upgradeoperator_scaling_failed` + +**Upgrade Window Metrics** +- ❌ `upgradeoperator_upgrade_window_breached` + +**Timeout Metrics** +- ❌ `upgradeoperator_controlplane_timeout` +- ❌ `upgradeoperator_worker_timeout` + +**Health Check Metrics** +- ❌ `upgradeoperator_healthcheck_failed` + +**Node Drain Metrics** +- ❌ `upgradeoperator_node_drain_timeout` + +**Notification Metrics** +- ❌ `upgrade_notification_failed` +- ❌ `upgradeoperator_upgrade_notification` + +**Timestamp Metrics** +- ❌ `upgradeoperator_upgradeconfig_sync_timestamp` +- ❌ `upgradeoperator_upgrade_started_timestamp` +- ❌ `upgradeoperator_upgrade_completed_timestamp` +- ❌ `upgradeoperator_controlplane_upgrade_started_timestamp` +- ❌ `upgradeoperator_controlplane_upgrade_completed_timestamp` +- ❌ `upgradeoperator_workernode_upgrade_started_timestamp` +- ❌ `upgradeoperator_workernode_upgrade_completed_timestamp` + +**Result Metrics** +- ❌ `upgradeoperator_upgrade_result` + +**Note**: Full e2e coverage is challenging because many metrics require actual cluster upgrade execution or failure injection, which is time-consuming and resource-intensive for automated testing. + +--- + +## Quick Reference + +### Metric Name to File Mapping + +| Metric | Primary Trigger Location | +|--------|-------------------------| +| `upgradeconfig_validation_failed` | `controllers/upgradeconfig/upgradeconfig_controller.go:209` | +| `scaling_failed` | `pkg/upgraders/scalerstep.go:51` | +| `upgrade_window_breached` | `pkg/upgraders/osdupgrader.go:170` | +| `controlplane_timeout` | `pkg/upgraders/controlplanestep.go:80` | +| `healthcheck_failed` | `pkg/upgraders/healthcheckstep.go` (multiple) | +| `worker_timeout` | `pkg/upgraders/workerstep.go:28` | +| `node_drain_timeout` | `controllers/nodekeeper/nodekeeper_controller.go:161` | +| `upgrade_notification_failed` | `pkg/eventmanager/eventmanager.go:157` | +| `upgrade_notification` | `pkg/eventmanager/eventmanager.go:161` | +| `upgradeconfig_sync_timestamp` | `pkg/upgradeconfigmanager/upgradeconfigmanager.go:168` | +| `upgrade_started_timestamp` | `pkg/upgraders/notifierstep.go:32` | +| `upgrade_completed_timestamp` | `pkg/upgraders/notifierstep.go:45` | +| `controlplane_upgrade_started_timestamp` | `pkg/upgraders/controlplanestep.go:35` | +| `controlplane_upgrade_completed_timestamp` | `pkg/upgraders/controlplanestep.go:61` | +| `workernode_upgrade_started_timestamp` | `pkg/upgraders/controlplanestep.go:62` | +| `workernode_upgrade_completed_timestamp` | `pkg/upgraders/workerstep.go:42` | +| `upgrade_result` | `controllers/upgradeconfig/upgradeconfig_controller.go:326` | + +### Metrics by Upgrade Phase + +**Pre-Upgrade** +- `upgradeconfig_validation_failed` +- `upgradeconfig_sync_timestamp` +- `scaling_failed` +- `healthcheck_failed` (pre-upgrade checks) + +**Control Plane Upgrade** +- `upgrade_started_timestamp` +- `controlplane_upgrade_started_timestamp` +- `upgrade_window_breached` +- `controlplane_timeout` +- `controlplane_upgrade_completed_timestamp` +- `workernode_upgrade_started_timestamp` + +**Worker Upgrade** +- `worker_timeout` +- `node_drain_timeout` +- `workernode_upgrade_completed_timestamp` +- `healthcheck_failed` (post-upgrade checks) + +**Post-Upgrade** +- `upgrade_completed_timestamp` +- `upgrade_result` +- Scaling down extra nodes + +**Throughout** +- `upgrade_notification` +- `upgrade_notification_failed` + +--- + +## Related Documentation + +- [Metrics Reference](metrics.md) - List of all exposed metrics +- [Alerts Reference](https://github.com/openshift/managed-cluster-config/blob/master/deploy/sre-prometheus/100-managed-upgrade-operator.PrometheusRule.yaml) - Alert definitions +- [Development Guide](development.md) - Setting up local development environment +- [Testing Guide](testing.md) - Running unit and e2e tests + +--- diff --git a/docs/metrics.md b/docs/metrics.md index e3e1ceec..2d7d974c 100644 --- a/docs/metrics.md +++ b/docs/metrics.md @@ -125,3 +125,69 @@ managed_upgrade_condition_workers_upgraded_timestamp managed_upgrade_condition_post_upgrade_healthcheck_timestamp ``` + +## Adding New Metrics + +All changes are in `pkg/metrics/metrics.go` unless noted. + +### 1. Define the metric + +```go +// Counter (use _total suffix) +var metricExample = prometheus.NewCounterVec(prometheus.CounterOpts{ + Subsystem: metricsTag, + Name: "example_total", + Help: "Description", +}, []string{nameLabel}) + +// Histogram (use _seconds suffix for durations) +var metricExampleDuration = prometheus.NewHistogramVec(prometheus.HistogramOpts{ + Subsystem: metricsTag, + Name: "example_duration_seconds", + Help: "Description", + Buckets: prometheus.DefBuckets, +}, []string{nameLabel}) + +// Gauge +var metricExampleGauge = prometheus.NewGaugeVec(prometheus.GaugeOpts{ + Subsystem: metricsTag, + Name: "example_value", + Help: "Description", +}, []string{nameLabel}) +``` + +### 2. Register in init() + +```go +metrics.Registry.MustRegister(metricExample) +``` + +For gauges that reset between upgrades, add to `ephemeralMetrics` slice instead. + +### 3. Add to Metrics interface + +```go +UpdateMetricExample(upgradeConfigName string) +``` + +### 4. Implement on Counter struct + +```go +func (c *Counter) UpdateMetricExample(upgradeConfigName string) { + metricExample.With(prometheus.Labels{nameLabel: upgradeConfigName}).Inc() +} +``` + +### 5. Call from code + +```go +metricsClient.UpdateMetricExample(instance.Name) +``` + +### 6. Regenerate mocks and verify + +```bash +make generate +make run-standard-routes +curl -s http://localhost:8383/metrics | grep example +```