sradco · sradco · Mar 1, 2026
diff --git a/docs/alert-rule-classification.md b/docs/alert-rule-classification.md
@@ -0,0 +1,254 @@
+# Alert Rule Classification - Design and Usage
+
+## Overview
+The backend classifies Prometheus alerting rules into a "component" and an "impact layer". It:
+- Computes an `openshift_io_alert_rule_id` per alerting rule.
+- Determines component/layer based on matcher logic and rule labels.
+- Allows operator-managed classification overrides via AlertRelabelConfigs (ARCs) for platform
+  rules. Operator-managed classification overrides of user-defined workload rules require the `ENABLE_USER_WORKLOAD_ARCS` feature flag.
+- Enriches the Alerts API response with `openshift_io_alert_rule_id`, `openshift_io_alert_component`, and `openshift_io_alert_layer`.
+
+This document explains how it works, how to override, and how to test it.
+
+
+## Terminology
+- openshift_io_alert_rule_id: Identifier for an alerting rule. Computed from a canonicalized view of the rule definition and encoded as `rid_` + base64url(nopad(sha256(payload))). Independent of `PrometheusRule` name.
+- component: Logical owner of the alert (e.g., `kube-apiserver`, `etcd`, a namespace, etc.).
+- layer: Impact scope. Allowed values:
+  - `cluster`
+  - `namespace`
+
+Notes:
+- **Stability**:
+  - The id is **always derived from the rule spec**. If the rule definition changes (expr/for/business labels/name), the id may change.
+  - For **platform rules**, this API currently only supports label updates via `AlertRelabelConfig` (not editing expr/for), so the id is effectively stable unless the upstream operator changes the rule definition.
+  - For **user-defined rules**, the API stamps the computed id into the `PrometheusRule` rule labels. If you update the rule definition, the API returns the **new** id and migrates any existing classification override to the new id.
+- Layer values are validated as `cluster|namespace` when set. To remove an override, set the field to `null` via the API; empty/invalid values are ignored at read time.
+
+## Rule ID computation (openshift_io_alert_rule_id)
+Location: `pkg/alert_rule/alert_rule.go`
+
+The backend computes a specHash-like value from:
+- `kind`/`name`: `alert` + `alert:` name or `record` + `record:` name
+- `expr`: trimmed with consecutive whitespace collapsed
+- `for`: trimmed (duration string as written in the rule)
+- `labels`: only non-system labels
+  - excludes labels with `openshift_io_` prefix and the `alertname` label
+  - drops empty values
+  - keeps only valid Prometheus label names (`[a-zA-Z_][a-zA-Z0-9_]*`)
+  - sorted by key and joined as `key=value` lines
+
+Annotations are intentionally ignored to reduce id churn on documentation-only changes.
+
+## Classification Logic (How component/layer are determined)
+Location: `pkg/alertcomponent/matcher.go`
+
+1) The code adapts `cluster-health-analyzer` matchers:
+   - CVO-related alerts (update/upgrade) → component/layer based on known patterns
+   - Compute / node-related alerts
+   - Core control plane components (renamed to layer `cluster`)
+   - Workload/namespace-level alerts (renamed to layer `namespace`)
+
+2) Fallback:
+   - If the computed component is empty or "Others", we set:
+     - `component = other`
+     - `layer` derived from source:
+       - `openshift_io_alert_source=platform` → `cluster`
+       - `openshift_io_prometheus_rule_namespace=openshift-monitoring` → `cluster`
+       - `prometheus` label starting with `openshift-monitoring/` → `cluster`
+       - otherwise → `namespace`
+
+3) Result:
+   - Each alerting rule is assigned a `(component, layer)` tuple following the above logic.
+
+## Developer Overrides via Rule Labels (Recommended)
+If you want explicit component/layer values and do not want to rely on the matcher, set
+these labels on each rule in your `PrometheusRule`:
+- `openshift_io_alert_rule_component`
+- `openshift_io_alert_rule_layer`
+
+Both are validated the same way as API overrides:
+- `component`: 1-253 chars, alphanumeric + `._-`, must start/end alphanumeric
+- `layer`: `cluster` or `namespace`
+
+When these labels are present and valid, they override matcher-derived values.
+
+## Classification Override Storage
+
+Location: `pkg/management/update_classification.go`, `pkg/management/get_alerts.go`
+
+Classification overrides are stored differently depending on the rule type:
+
+### Platform rules → AlertRelabelConfig (ARC)
+
+For operator-managed platform rules (rules whose `PrometheusRule` is registered as a
+platform resource), overrides are stored in an `AlertRelabelConfig` (ARC) CR in the
+`openshift-monitoring` namespace.
+
+- **ARC naming**: `arc-<sanitized-pr-name>-<short-hash-of-rule-id>`
+  (generated by `k8s.GetAlertRelabelConfigName`)
+- **ARC namespace**: `openshift-monitoring`
+- **Shared ARC**: classification labels are written into the same ARC that the platform
+  alert management path uses for other label changes (severity, Drop/Restore). This avoids
+  creating separate CRs per concern.
+- **Labels on the ARC**:
+  - `monitoring.openshift.io/prometheus-rule-name`: name of the source `PrometheusRule`
+  - `monitoring.openshift.io/alert-name`: alert name
+- **Annotation on the ARC**:
+  - `monitoring.openshift.io/alert-rule-id`: the `openshift_io_alert_rule_id`
+
+The ARC contains `RelabelConfig` entries that:
+1. Match the rule by its original labels (alert name + all non-namespace labels) and
+   stamp `openshift_io_alert_rule_id` via a `Replace` action.
+2. Apply each classification label as a `Replace` action keyed on `openshift_io_alert_rule_id`.
+
+When all overrides are removed, the ARC is deleted.
+
+**AlertingRule CR distinction:** Some platform alerts are defined via `AlertingRule` CRs,
+which the cluster-monitoring-operator reconciles into `PrometheusRule` resources. When
+the owning `AlertingRule` CR is operator-managed (has operator owner references), the
+backend cannot modify it directly (the operator would reconcile the change back). In
+this case, label updates are applied through an ARC instead. When the `AlertingRule` CR
+is not externally managed, label updates are written directly into the CR. Classification
+overrides always use the ARC path regardless of the `AlertingRule` management status.
+
+### User-defined workload rules → blocked by default, ARC when enabled
+
+Classification updates for operator-managed user-defined workload rules are **not
+allowed by default**. The API returns a `NotAllowedError` when the feature flag is
+disabled.
+
+### Feature flag: `ENABLE_USER_WORKLOAD_ARCS`
+
+Setting the environment variable `ENABLE_USER_WORKLOAD_ARCS=true` enables full
+alert management for operator-managed user-defined workload rules, including
+classification overrides, label updates, and rule disable/enable (Drop/Restore).
+When enabled, these rules use the same ARC-based path as platform rules, with
+ARCs stored in the `openshift-user-workload-monitoring` namespace.
+
+### Dynamic classification (`_from` labels)
+
+Two special labels allow deriving component/layer dynamically from the alert itself
+at query time:
+- `openshift_io_alert_rule_component_from`: name of an alert label whose value
+  becomes the component (e.g., `"name"` → use the alert's `name` label).
+- `openshift_io_alert_rule_layer_from`: same pattern for layer.
+
+These `_from` labels are stored in the ARC alongside static classification labels.
+At read time, `ApplyDynamicClassification` resolves them against the alert's labels.
+
+### Read path
+
+The read path is unified regardless of storage mechanism:
+1. The relabeled rules cache (`k8s.RelabeledRules().Get`) returns each rule with all
+   ARC relabel configs already applied. This means classification labels (whether set
+   via ARC or directly on the `PrometheusRule`) are available as rule labels.
+2. `ApplyDynamicClassification` checks for `_from` labels on the relabeled rule and
+   resolves them against the alert's own labels to produce the final component/layer.
+
+Notes:
+- `_from` values must be valid Prometheus label names (`[a-zA-Z_][a-zA-Z0-9_]*`).
+- If a `_from` label is present but the alert does not carry that label or the derived
+  value is invalid, the backend falls back to static values (if present) or defaults.
+- If all overrides are removed, the ARC is deleted.
+
+
+## Alerts API Enrichment
+Location: `pkg/management/get_alerts.go`, `pkg/k8s/prometheus_alerts.go`
+
+- Endpoint: `GET /api/v1/alerting/alerts` (prom-compatible schema)
+- The backend fetches active alerts and enriches each alert with:
+  - `openshift_io_alert_rule_id`
+  - `openshift_io_alert_component`
+  - `openshift_io_alert_layer`
+  - `prometheusRuleName`: name of the PrometheusRule resource the alert originates from
+  - `prometheusRuleNamespace`: namespace of that PrometheusRule resource
+  - `alertingRuleName`: name of the AlertingRule CR that generated the PrometheusRule (empty when the PrometheusRule is not owned by an AlertingRule CR)
+- Prometheus compatibility:
+  - Base response matches Prometheus `/api/v1/alerts`.
+  - Additional fields are additive and safe for clients like Perses.
+
+## Prometheus/Thanos Sources
+Location: `pkg/k8s/prometheus_alerts.go`
+
+- Order of candidates:
+  1) Thanos Route `thanos-querier` at `/api` + `/v1/alerts` (oauth-proxied)
+  2) In-cluster Thanos service `https://thanos-querier.openshift-monitoring.svc:9091/api/v1/alerts`
+  3) In-cluster Prometheus `https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts`
+  4) In-cluster Prometheus (plain HTTP) `http://prometheus-k8s.openshift-monitoring.svc:9090/api/v1/alerts` (fallback)
+  5) Prometheus Route `prometheus-k8s` at `/api/v1/alerts`
+
+- TLS and Auth:
+  - Bearer token: service account token from in-cluster config.
+  - CA trust: system pool + `SSL_CERT_FILE` + `/var/run/configmaps/service-ca/service-ca.crt`.
+
+RBAC:
+- Read routes in `openshift-monitoring`.
+- Access `prometheuses/api` as needed for oauth-proxied endpoints.
+
+## Updating Rules Classification
+APIs:
+- Single update:
+  - Method: `PATCH /api/v1/alerting/rules/{ruleId}`
+  - Request body:
+    ```json
+    {
+      "classification": {
+        "openshift_io_alert_rule_component": "team-x",
+        "openshift_io_alert_rule_layer": "namespace",
+        "openshift_io_alert_rule_component_from": "name",
+        "openshift_io_alert_rule_layer_from": "layer"
+      }
+    }
+    ```
+    - `openshift_io_alert_rule_layer`: `cluster` or `namespace`
+    - To remove a classification override, set the field to `null` (e.g. `"openshift_io_alert_rule_layer": null`).
+  - Response:
+    - 200 OK with a status payload (same format as other rule PATCH responses), where `status_code` is 204 on success.
+    - Standard error body on failure (400 validation, 404 not found, etc.)
+- Bulk update:
+  - Method: `PATCH /api/v1/alerting/rules`
+  - Request body:
+    ```json
+    {
+      "ruleIds": ["<id-a>", "<id-b>"],
+      "classification": {
+        "openshift_io_alert_rule_component": "etcd",
+        "openshift_io_alert_rule_layer": "cluster"
+      }
+    }
+    ```
+  - Response:
+    - 200 OK with per-rule results (same format as other bulk rule PATCH responses). Clients should handle partial failures.
+
+Direct K8s (supported for power users/GitOps):
+- For platform rules: create or update the `AlertRelabelConfig` CR in `openshift-monitoring`
+  with the appropriate relabel configs (respect `resourceVersion` for optimistic concurrency).
+- For user-defined rules (requires `ENABLE_USER_WORKLOAD_ARCS=true`): create or update the
+  `AlertRelabelConfig` CR in `openshift-user-workload-monitoring`.
+- UI should check update permissions with SelfSubjectAccessReview before showing an editor.
+
+Notes:
+- These endpoints are intended for updating **classification only** (component/layer overrides),
+  with permissions enforced based on the rule's ownership (platform, user workload, operator-managed,
+  GitOps-managed).
+- To update other rule fields (expr/labels/annotations/etc.), use `PATCH /api/v1/alerting/rules/{ruleId}`.
+  Clients that need to update both should issue two requests. The combined operation is not atomic.
+
+## Security Notes
+- Classification overrides are stored in AlertRelabelConfig CRs (`openshift-monitoring`
+  for platform rules, `openshift-user-workload-monitoring` for user-defined rules when
+  enabled), subject to standard Kubernetes RBAC.
+- No secrets or sensitive data are persisted in classification metadata.
+
+## Testing and Ops
+Unit tests:
+- `pkg/management/update_classification_test.go`
+  - ARC-based classification for platform rules, blocked-by-default for user-defined
+    rules, ARC in user-workload namespace when flag enabled, dynamic `_from` label resolution.
+- `pkg/management/get_alerts_test.go`
+  - Alert enrichment with classification labels, `_from` label behavior, fallback behavior.
+
+## Future Work
+- Optional composite update API if we need to update rule fields and classification atomically.
+- De-duplication/merge logic when aggregating alerts across sources.
diff --git a/internal/managementrouter/alert_rule_classification_patch.go b/internal/managementrouter/alert_rule_classification_patch.go
@@ -0,0 +1,66 @@
+package managementrouter
+
+import "encoding/json"
+
+// AlertRuleClassificationPatch represents a partial update ("patch") payload for
+// alert rule classification labels.
+//
+// This type supports a three-state contract per field:
+// - omitted: leave unchanged
+// - null: clear the override
+// - string: set the override
+//
+// Note: Go's encoding/json cannot represent "explicit null" vs "omitted" using **string
+// (both decode to nil), so we custom-unmarshal and track key presence with *Set flags.
+type AlertRuleClassificationPatch struct {
+	Component        *string `json:"openshift_io_alert_rule_component,omitempty"`
+	ComponentSet     bool    `json:"-"`
+	Layer            *string `json:"openshift_io_alert_rule_layer,omitempty"`
+	LayerSet         bool    `json:"-"`
+	ComponentFrom    *string `json:"openshift_io_alert_rule_component_from,omitempty"`
+	ComponentFromSet bool    `json:"-"`
+	LayerFrom        *string `json:"openshift_io_alert_rule_layer_from,omitempty"`
+	LayerFromSet     bool    `json:"-"`
+}
+
+func (p *AlertRuleClassificationPatch) UnmarshalJSON(b []byte) error {
+	var m map[string]json.RawMessage
+	if err := json.Unmarshal(b, &m); err != nil {
+		return err
+	}
+
+	decodeNullableString := func(key string) (set bool, v *string, err error) {
+		raw, ok := m[key]
+		if !ok {
+			return false, nil, nil
+		}
+		set = true
+		if len(raw) == 0 || string(raw) == "null" {
+			return true, nil, nil
+		}
+		var s string
+		if err := json.Unmarshal(raw, &s); err != nil {
+			return true, nil, err
+		}
+		return true, &s, nil
+	}
+
+	var err error
+	p.ComponentSet, p.Component, err = decodeNullableString("openshift_io_alert_rule_component")
+	if err != nil {
+		return err
+	}
+	p.LayerSet, p.Layer, err = decodeNullableString("openshift_io_alert_rule_layer")
+	if err != nil {
+		return err
+	}
+	p.ComponentFromSet, p.ComponentFrom, err = decodeNullableString("openshift_io_alert_rule_component_from")
+	if err != nil {
+		return err
+	}
+	p.LayerFromSet, p.LayerFrom, err = decodeNullableString("openshift_io_alert_rule_layer_from")
+	if err != nil {
+		return err
+	}
+	return nil
+}
diff --git a/internal/managementrouter/alert_rule_classification_patch_test.go b/internal/managementrouter/alert_rule_classification_patch_test.go
@@ -0,0 +1,40 @@
+package managementrouter_test
+
+import (
+	"encoding/json"
+
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+
+	"github.com/openshift/monitoring-plugin/internal/managementrouter"
+)
+
+var _ = Describe("AlertRuleClassificationPatch", func() {
+	Context("when field is omitted", func() {
+		It("does not mark it as set", func() {
+			var p managementrouter.AlertRuleClassificationPatch
+			Expect(json.Unmarshal([]byte(`{}`), &p)).To(Succeed())
+			Expect(p.ComponentSet).To(BeFalse())
+			Expect(p.Component).To(BeNil())
+		})
+	})
+
+	Context("when field is explicitly null", func() {
+		It("marks it as set and clears the value", func() {
+			var p managementrouter.AlertRuleClassificationPatch
+			Expect(json.Unmarshal([]byte(`{"openshift_io_alert_rule_component":null}`), &p)).To(Succeed())
+			Expect(p.ComponentSet).To(BeTrue())
+			Expect(p.Component).To(BeNil())
+		})
+	})
+
+	Context("when field is a string", func() {
+		It("marks it as set and provides the value", func() {
+			var p managementrouter.AlertRuleClassificationPatch
+			Expect(json.Unmarshal([]byte(`{"openshift_io_alert_rule_component":"team-x"}`), &p)).To(Succeed())
+			Expect(p.ComponentSet).To(BeTrue())
+			Expect(p.Component).NotTo(BeNil())
+			Expect(*p.Component).To(Equal("team-x"))
+		})
+	})
+})
diff --git a/internal/managementrouter/health_get_test.go b/internal/managementrouter/health_get_test.go
@@ -142,3 +142,11 @@ func (s *healthStubManagementClient) GetAlertingHealth(ctx context.Context) (k8s
 	}
 	return k8s.AlertingHealth{}, nil
 }
+
+func (s *healthStubManagementClient) UpdateAlertRuleClassification(ctx context.Context, req management.UpdateRuleClassificationRequest) error {
+	return nil
+}
+
+func (s *healthStubManagementClient) BulkUpdateAlertRuleClassification(ctx context.Context, items []management.UpdateRuleClassificationRequest) []error {
+	return nil
+}
diff --git a/internal/managementrouter/rules_get_test.go b/internal/managementrouter/rules_get_test.go
@@ -174,3 +174,11 @@ func (s *stubManagementClient) GetAlertingHealth(ctx context.Context) (k8s.Alert
 	}
 	return k8s.AlertingHealth{}, nil
 }
+
+func (s *stubManagementClient) UpdateAlertRuleClassification(ctx context.Context, req management.UpdateRuleClassificationRequest) error {
+	return nil
+}
+
+func (s *stubManagementClient) BulkUpdateAlertRuleClassification(ctx context.Context, items []management.UpdateRuleClassificationRequest) []error {
+	return nil
+}