Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion perses/charts/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ name: perses
description: A Helm chart for Perses
icon: https://avatars.githubusercontent.com/u/77209215?s=200&v=4
type: application
version: 0.17.4
version: 0.17.5
maintainers:
- name: richardtief
- name: ibakshay
Expand Down
5 changes: 1 addition & 4 deletions perses/charts/alerts/perses.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ groups:
for: 10m
labels:
severity: warning
playbook: https://github.com/cloudoperators/greenhouse-extensions/tree/main/perses/playbooks/playbook.md
playbook: https://github.com/cloudoperators/greenhouse-extensions/tree/main/perses/playbooks/PersesServiceDown.md
{{- include "perses.alertLabels" . | nindent 10 }}

- alert: PersesHighHttpErrorRate
Expand All @@ -28,7 +28,6 @@ groups:
for: 10m
labels:
severity: warning
playbook: https://github.com/cloudoperators/greenhouse-extensions/tree/main/perses/playbooks/playbook.md
{{- include "perses.alertLabels" . | nindent 10 }}

- alert: PersesPluginSchemaLoadFailures
Expand All @@ -40,7 +39,6 @@ groups:
for: 15m
labels:
severity: warning
playbook: https://github.com/cloudoperators/greenhouse-extensions/tree/main/perses/playbooks/playbook.md
{{- include "perses.alertLabels" . | nindent 10 }}

- alert: PersesHighFileDescriptorUsage
Expand All @@ -52,5 +50,4 @@ groups:
for: 15m
labels:
severity: info
playbook: https://github.com/cloudoperators/greenhouse-extensions/tree/main/perses/playbooks/playbook.md
{{- include "perses.alertLabels" . | nindent 10 }}
103 changes: 103 additions & 0 deletions perses/playbooks/PersesServiceDown.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# Perses Service Down

## Problem
The Perses service is currently **offline**.

It is either completely stopped, crashing repeatedly, or running but refusing to respond to requests.

## Impact

* **Service Outage:** Users cannot access or view any Perses dashboards.

## Diagnosis
Follow these steps to determine if the service is crashed, "hung," running on a bad node, or turned off.

### 1. Check Pod Status
Identify the specific Perses instance referenced in the alert.

* **Option A: Check the specific namespace** (from the alert label `namespace`):

```bash
kubectl get pods -n <namespace> -l app.kubernetes.io/name=perses
```

**Analyze the Output:**

* **`CrashLoopBackOff`:** The application is starting but failing immediately. **Go to Step 3**.
* **`Running`:** The application appears healthy to Kubernetes. **Go to Step 2**.
* **`Pending`:** The cluster is out of resources or the node is tainted. **Go to Step 2**.
* **No resources found:** The list is empty. **Go to Step 4**.

### 2\. Verify Node Health

Sometimes a pod appears `Running`, but the underlying Node is disconnected (`NotReady`), causing network traffic to fail.

1. Find the Node where the pod is running:
```bash
kubectl get pods -n <namespace> -l app.kubernetes.io/name=perses -o wide
```
2. Check the status of that Node:
```bash
kubectl get node <node-name-from-previous-step>
```
* **If Status is `NotReady`:** The node is down. **Go to Resolution C**.
* **If Status is `Ready`:** The node is fine, but the process is hung. **Go to Resolution D**.

### 3\. Inspect Application Logs

If the pod status is `CrashLoopBackOff` or `Error`, check the logs to find the root cause.

```bash
kubectl logs statefulset/perses -n <namespace> --all-containers
```

### 4\. Check if Service is Scaled Down

If **Step 1** returned "No resources found," verify if the StatefulSet was scaled to 0 (maintenance or accident).

```bash
kubectl get statefulset -n <namespace> -l app.kubernetes.io/name=perses
```

* **Result `READY 0/0`:** The service is stopped. **Go to Resolution A**.

## Resolution Steps

### Scenario A: Service Scaled to 0 (Stopped)

**Diagnosis:** StatefulSet shows `0/0` replicas.

1. **Check Context:** Verify if this is a planned maintenance.
* *If Planned:* **Silence the alert** in Alertmanager.
* *If Accidental:* Start the service:
```bash
kubectl scale statefulset <statefulset-name> --replicas=1 -n <namespace>
```

### Scenario B: Configuration Error (CrashLoopBackOff)

**Diagnosis:** Logs show syntax errors or panic.

1. Rollback the Helm release if a recent change caused the crash:
```bash
helm rollback <release-name> 0 -n <namespace>
```

### Scenario C: Node Failure

**Diagnosis:** The Node hosting the pod is `NotReady`.

1. Force delete the pod. Since the node is unresponsive, a standard delete might hang.
```bash
kubectl delete pod <pod-name> -n <namespace> --grace-period=0 --force
```
*Result: The StatefulSet controller will immediately reschedule the pod onto a healthy node.*

### Scenario D: Hung Process (Unresponsive)

**Diagnosis:** Pod is `Running`, Node is `Ready`, but `up == 0`.

1. Force a restart to clear the application deadlock:
```bash
kubectl rollout restart statefulset <statefulset-name> -n <namespace>
```
3 changes: 0 additions & 3 deletions perses/playbooks/playbook.md

This file was deleted.

4 changes: 2 additions & 2 deletions perses/plugindefinition.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,15 @@ kind: PluginDefinition
metadata:
name: perses
spec:
version: 0.10.3
version: 0.10.4
displayName: Perses
description: "Perses is a dashboard tooling to visualize metrics and traces produced by observability tools such as Prometheus/Thanos/Jaeger"
docMarkDownUrl: https://raw.githubusercontent.com/cloudoperators/greenhouse-extensions/main/perses/README.md
icon: https://raw.githubusercontent.com/cloudoperators/greenhouse-extensions/main/perses/logo.png
helmChart:
name: perses
repository: oci://ghcr.io/cloudoperators/greenhouse-extensions/charts
version: 0.17.4
version: 0.17.5
options:
- description: "The image version of the Perses app. If not provided, the latest version will be used"
name: perses.image.version
Expand Down
Loading