Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
254 changes: 254 additions & 0 deletions docs/alert-rule-classification.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,254 @@
# Alert Rule Classification - Design and Usage

## Overview
The backend classifies Prometheus alerting rules into a "component" and an "impact layer". It:
- Computes an `openshift_io_alert_rule_id` per alerting rule.
- Determines component/layer based on matcher logic and rule labels.
- Allows operator-managed classification overrides via AlertRelabelConfigs (ARCs) for platform
rules. Operator-managed classification overrides of user-defined workload rules require the `ENABLE_USER_WORKLOAD_ARCS` feature flag.
- Enriches the Alerts API response with `openshift_io_alert_rule_id`, `openshift_io_alert_component`, and `openshift_io_alert_layer`.

This document explains how it works, how to override, and how to test it.


## Terminology
- openshift_io_alert_rule_id: Identifier for an alerting rule. Computed from a canonicalized view of the rule definition and encoded as `rid_` + base64url(nopad(sha256(payload))). Independent of `PrometheusRule` name.
- component: Logical owner of the alert (e.g., `kube-apiserver`, `etcd`, a namespace, etc.).
- layer: Impact scope. Allowed values:
- `cluster`
- `namespace`

Notes:
- **Stability**:
- The id is **always derived from the rule spec**. If the rule definition changes (expr/for/business labels/name), the id may change.
- For **platform rules**, this API currently only supports label updates via `AlertRelabelConfig` (not editing expr/for), so the id is effectively stable unless the upstream operator changes the rule definition.
- For **user-defined rules**, the API stamps the computed id into the `PrometheusRule` rule labels. If you update the rule definition, the API returns the **new** id and migrates any existing classification override to the new id.
- Layer values are validated as `cluster|namespace` when set. To remove an override, set the field to `null` via the API; empty/invalid values are ignored at read time.

## Rule ID computation (openshift_io_alert_rule_id)
Location: `pkg/alert_rule/alert_rule.go`

The backend computes a specHash-like value from:
- `kind`/`name`: `alert` + `alert:` name or `record` + `record:` name
- `expr`: trimmed with consecutive whitespace collapsed
- `for`: trimmed (duration string as written in the rule)
- `labels`: only non-system labels
- excludes labels with `openshift_io_` prefix and the `alertname` label
- drops empty values
- keeps only valid Prometheus label names (`[a-zA-Z_][a-zA-Z0-9_]*`)
- sorted by key and joined as `key=value` lines

Annotations are intentionally ignored to reduce id churn on documentation-only changes.

## Classification Logic (How component/layer are determined)
Location: `pkg/alertcomponent/matcher.go`

1) The code adapts `cluster-health-analyzer` matchers:
- CVO-related alerts (update/upgrade) → component/layer based on known patterns
- Compute / node-related alerts
- Core control plane components (renamed to layer `cluster`)
- Workload/namespace-level alerts (renamed to layer `namespace`)

2) Fallback:
- If the computed component is empty or "Others", we set:
- `component = other`
- `layer` derived from source:
- `openshift_io_alert_source=platform` → `cluster`
- `openshift_io_prometheus_rule_namespace=openshift-monitoring` → `cluster`
- `prometheus` label starting with `openshift-monitoring/` → `cluster`
- otherwise → `namespace`

3) Result:
- Each alerting rule is assigned a `(component, layer)` tuple following the above logic.

## Developer Overrides via Rule Labels (Recommended)
If you want explicit component/layer values and do not want to rely on the matcher, set
these labels on each rule in your `PrometheusRule`:
- `openshift_io_alert_rule_component`
- `openshift_io_alert_rule_layer`

Both are validated the same way as API overrides:
- `component`: 1-253 chars, alphanumeric + `._-`, must start/end alphanumeric
- `layer`: `cluster` or `namespace`

When these labels are present and valid, they override matcher-derived values.

## Classification Override Storage

Location: `pkg/management/update_classification.go`, `pkg/management/get_alerts.go`

Classification overrides are stored differently depending on the rule type:

### Platform rules → AlertRelabelConfig (ARC)

For operator-managed platform rules (rules whose `PrometheusRule` is registered as a
platform resource), overrides are stored in an `AlertRelabelConfig` (ARC) CR in the
`openshift-monitoring` namespace.

- **ARC naming**: `arc-<sanitized-pr-name>-<short-hash-of-rule-id>`
(generated by `k8s.GetAlertRelabelConfigName`)
- **ARC namespace**: `openshift-monitoring`
- **Shared ARC**: classification labels are written into the same ARC that the platform
alert management path uses for other label changes (severity, Drop/Restore). This avoids
creating separate CRs per concern.
- **Labels on the ARC**:
- `monitoring.openshift.io/prometheus-rule-name`: name of the source `PrometheusRule`
- `monitoring.openshift.io/alert-name`: alert name
- **Annotation on the ARC**:
- `monitoring.openshift.io/alert-rule-id`: the `openshift_io_alert_rule_id`

The ARC contains `RelabelConfig` entries that:
1. Match the rule by its original labels (alert name + all non-namespace labels) and
stamp `openshift_io_alert_rule_id` via a `Replace` action.
2. Apply each classification label as a `Replace` action keyed on `openshift_io_alert_rule_id`.

When all overrides are removed, the ARC is deleted.

**AlertingRule CR distinction:** Some platform alerts are defined via `AlertingRule` CRs,
which the cluster-monitoring-operator reconciles into `PrometheusRule` resources. When
the owning `AlertingRule` CR is operator-managed (has operator owner references), the
backend cannot modify it directly (the operator would reconcile the change back). In
this case, label updates are applied through an ARC instead. When the `AlertingRule` CR
is not externally managed, label updates are written directly into the CR. Classification
overrides always use the ARC path regardless of the `AlertingRule` management status.

### User-defined workload rules → blocked by default, ARC when enabled

Classification updates for operator-managed user-defined workload rules are **not
allowed by default**. The API returns a `NotAllowedError` when the feature flag is
disabled.

### Feature flag: `ENABLE_USER_WORKLOAD_ARCS`

Setting the environment variable `ENABLE_USER_WORKLOAD_ARCS=true` enables full
alert management for operator-managed user-defined workload rules, including
classification overrides, label updates, and rule disable/enable (Drop/Restore).
When enabled, these rules use the same ARC-based path as platform rules, with
ARCs stored in the `openshift-user-workload-monitoring` namespace.

### Dynamic classification (`_from` labels)

Two special labels allow deriving component/layer dynamically from the alert itself
at query time:
- `openshift_io_alert_rule_component_from`: name of an alert label whose value
becomes the component (e.g., `"name"` → use the alert's `name` label).
- `openshift_io_alert_rule_layer_from`: same pattern for layer.

These `_from` labels are stored in the ARC alongside static classification labels.
At read time, `ApplyDynamicClassification` resolves them against the alert's labels.

### Read path

The read path is unified regardless of storage mechanism:
1. The relabeled rules cache (`k8s.RelabeledRules().Get`) returns each rule with all
ARC relabel configs already applied. This means classification labels (whether set
via ARC or directly on the `PrometheusRule`) are available as rule labels.
2. `ApplyDynamicClassification` checks for `_from` labels on the relabeled rule and
resolves them against the alert's own labels to produce the final component/layer.

Notes:
- `_from` values must be valid Prometheus label names (`[a-zA-Z_][a-zA-Z0-9_]*`).
- If a `_from` label is present but the alert does not carry that label or the derived
value is invalid, the backend falls back to static values (if present) or defaults.
- If all overrides are removed, the ARC is deleted.


## Alerts API Enrichment
Location: `pkg/management/get_alerts.go`, `pkg/k8s/prometheus_alerts.go`

- Endpoint: `GET /api/v1/alerting/alerts` (prom-compatible schema)
- The backend fetches active alerts and enriches each alert with:
- `openshift_io_alert_rule_id`
- `openshift_io_alert_component`
- `openshift_io_alert_layer`
- `prometheusRuleName`: name of the PrometheusRule resource the alert originates from
- `prometheusRuleNamespace`: namespace of that PrometheusRule resource
- `alertingRuleName`: name of the AlertingRule CR that generated the PrometheusRule (empty when the PrometheusRule is not owned by an AlertingRule CR)
- Prometheus compatibility:
- Base response matches Prometheus `/api/v1/alerts`.
- Additional fields are additive and safe for clients like Perses.

## Prometheus/Thanos Sources
Location: `pkg/k8s/prometheus_alerts.go`

- Order of candidates:
1) Thanos Route `thanos-querier` at `/api` + `/v1/alerts` (oauth-proxied)
2) In-cluster Thanos service `https://thanos-querier.openshift-monitoring.svc:9091/api/v1/alerts`
3) In-cluster Prometheus `https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts`
4) In-cluster Prometheus (plain HTTP) `http://prometheus-k8s.openshift-monitoring.svc:9090/api/v1/alerts` (fallback)
5) Prometheus Route `prometheus-k8s` at `/api/v1/alerts`

- TLS and Auth:
- Bearer token: service account token from in-cluster config.
- CA trust: system pool + `SSL_CERT_FILE` + `/var/run/configmaps/service-ca/service-ca.crt`.

RBAC:
- Read routes in `openshift-monitoring`.
- Access `prometheuses/api` as needed for oauth-proxied endpoints.

## Updating Rules Classification
APIs:
- Single update:
- Method: `PATCH /api/v1/alerting/rules/{ruleId}`
- Request body:
```json
{
"classification": {
"openshift_io_alert_rule_component": "team-x",
"openshift_io_alert_rule_layer": "namespace",
"openshift_io_alert_rule_component_from": "name",
"openshift_io_alert_rule_layer_from": "layer"
}
}
```
- `openshift_io_alert_rule_layer`: `cluster` or `namespace`
- To remove a classification override, set the field to `null` (e.g. `"openshift_io_alert_rule_layer": null`).
- Response:
- 200 OK with a status payload (same format as other rule PATCH responses), where `status_code` is 204 on success.
- Standard error body on failure (400 validation, 404 not found, etc.)
- Bulk update:
- Method: `PATCH /api/v1/alerting/rules`
- Request body:
```json
{
"ruleIds": ["<id-a>", "<id-b>"],
"classification": {
"openshift_io_alert_rule_component": "etcd",
"openshift_io_alert_rule_layer": "cluster"
}
}
```
- Response:
- 200 OK with per-rule results (same format as other bulk rule PATCH responses). Clients should handle partial failures.

Direct K8s (supported for power users/GitOps):
- For platform rules: create or update the `AlertRelabelConfig` CR in `openshift-monitoring`
with the appropriate relabel configs (respect `resourceVersion` for optimistic concurrency).
- For user-defined rules (requires `ENABLE_USER_WORKLOAD_ARCS=true`): create or update the
`AlertRelabelConfig` CR in `openshift-user-workload-monitoring`.
- UI should check update permissions with SelfSubjectAccessReview before showing an editor.

Notes:
- These endpoints are intended for updating **classification only** (component/layer overrides),
with permissions enforced based on the rule's ownership (platform, user workload, operator-managed,
GitOps-managed).
- To update other rule fields (expr/labels/annotations/etc.), use `PATCH /api/v1/alerting/rules/{ruleId}`.
Clients that need to update both should issue two requests. The combined operation is not atomic.

## Security Notes
- Classification overrides are stored in AlertRelabelConfig CRs (`openshift-monitoring`
for platform rules, `openshift-user-workload-monitoring` for user-defined rules when
enabled), subject to standard Kubernetes RBAC.
- No secrets or sensitive data are persisted in classification metadata.

## Testing and Ops
Unit tests:
- `pkg/management/update_classification_test.go`
- ARC-based classification for platform rules, blocked-by-default for user-defined
rules, ARC in user-workload namespace when flag enabled, dynamic `_from` label resolution.
- `pkg/management/get_alerts_test.go`
- Alert enrichment with classification labels, `_from` label behavior, fallback behavior.

## Future Work
- Optional composite update API if we need to update rule fields and classification atomically.
- De-duplication/merge logic when aggregating alerts across sources.
66 changes: 66 additions & 0 deletions internal/managementrouter/alert_rule_classification_patch.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
package managementrouter

import "encoding/json"

// AlertRuleClassificationPatch represents a partial update ("patch") payload for
// alert rule classification labels.
//
// This type supports a three-state contract per field:
// - omitted: leave unchanged
// - null: clear the override
// - string: set the override
//
// Note: Go's encoding/json cannot represent "explicit null" vs "omitted" using **string
// (both decode to nil), so we custom-unmarshal and track key presence with *Set flags.
type AlertRuleClassificationPatch struct {
Component *string `json:"openshift_io_alert_rule_component,omitempty"`
ComponentSet bool `json:"-"`
Layer *string `json:"openshift_io_alert_rule_layer,omitempty"`
LayerSet bool `json:"-"`
ComponentFrom *string `json:"openshift_io_alert_rule_component_from,omitempty"`
ComponentFromSet bool `json:"-"`
LayerFrom *string `json:"openshift_io_alert_rule_layer_from,omitempty"`
LayerFromSet bool `json:"-"`
}

func (p *AlertRuleClassificationPatch) UnmarshalJSON(b []byte) error {
var m map[string]json.RawMessage
if err := json.Unmarshal(b, &m); err != nil {
return err
}

decodeNullableString := func(key string) (set bool, v *string, err error) {
raw, ok := m[key]
if !ok {
return false, nil, nil
}
set = true
if len(raw) == 0 || string(raw) == "null" {
return true, nil, nil
}
var s string
if err := json.Unmarshal(raw, &s); err != nil {
return true, nil, err
}
return true, &s, nil
}

var err error
p.ComponentSet, p.Component, err = decodeNullableString("openshift_io_alert_rule_component")
if err != nil {
return err
}
p.LayerSet, p.Layer, err = decodeNullableString("openshift_io_alert_rule_layer")
if err != nil {
return err
}
p.ComponentFromSet, p.ComponentFrom, err = decodeNullableString("openshift_io_alert_rule_component_from")
if err != nil {
return err
}
p.LayerFromSet, p.LayerFrom, err = decodeNullableString("openshift_io_alert_rule_layer_from")
if err != nil {
return err
}
return nil
}
40 changes: 40 additions & 0 deletions internal/managementrouter/alert_rule_classification_patch_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
package managementrouter_test

import (
"encoding/json"

. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"

"github.com/openshift/monitoring-plugin/internal/managementrouter"
)

var _ = Describe("AlertRuleClassificationPatch", func() {
Context("when field is omitted", func() {
It("does not mark it as set", func() {
var p managementrouter.AlertRuleClassificationPatch
Expect(json.Unmarshal([]byte(`{}`), &p)).To(Succeed())
Expect(p.ComponentSet).To(BeFalse())
Expect(p.Component).To(BeNil())
})
})

Context("when field is explicitly null", func() {
It("marks it as set and clears the value", func() {
var p managementrouter.AlertRuleClassificationPatch
Expect(json.Unmarshal([]byte(`{"openshift_io_alert_rule_component":null}`), &p)).To(Succeed())
Expect(p.ComponentSet).To(BeTrue())
Expect(p.Component).To(BeNil())
})
})

Context("when field is a string", func() {
It("marks it as set and provides the value", func() {
var p managementrouter.AlertRuleClassificationPatch
Expect(json.Unmarshal([]byte(`{"openshift_io_alert_rule_component":"team-x"}`), &p)).To(Succeed())
Expect(p.ComponentSet).To(BeTrue())
Expect(p.Component).NotTo(BeNil())
Expect(*p.Component).To(Equal("team-x"))
})
})
})
8 changes: 8 additions & 0 deletions internal/managementrouter/health_get_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -142,3 +142,11 @@ func (s *healthStubManagementClient) GetAlertingHealth(ctx context.Context) (k8s
}
return k8s.AlertingHealth{}, nil
}

func (s *healthStubManagementClient) UpdateAlertRuleClassification(ctx context.Context, req management.UpdateRuleClassificationRequest) error {
return nil
}

func (s *healthStubManagementClient) BulkUpdateAlertRuleClassification(ctx context.Context, items []management.UpdateRuleClassificationRequest) []error {
return nil
}
8 changes: 8 additions & 0 deletions internal/managementrouter/rules_get_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -174,3 +174,11 @@ func (s *stubManagementClient) GetAlertingHealth(ctx context.Context) (k8s.Alert
}
return k8s.AlertingHealth{}, nil
}

func (s *stubManagementClient) UpdateAlertRuleClassification(ctx context.Context, req management.UpdateRuleClassificationRequest) error {
return nil
}

func (s *stubManagementClient) BulkUpdateAlertRuleClassification(ctx context.Context, items []management.UpdateRuleClassificationRequest) []error {
return nil
}
Loading