Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion perses/charts/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ name: perses
description: A Helm chart for Perses
icon: https://avatars.githubusercontent.com/u/77209215?s=200&v=4
type: application
version: 0.17.4
version: 0.17.5
maintainers:
- name: richardtief
- name: ibakshay
Expand Down
5 changes: 1 addition & 4 deletions perses/charts/alerts/perses.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ groups:
for: 10m
labels:
severity: warning
playbook: https://github.com/cloudoperators/greenhouse-extensions/tree/main/perses/playbooks/playbook.md
playbook: https://github.com/cloudoperators/greenhouse-extensions/tree/main/perses/playbooks/PersesServiceDown.md
{{- include "perses.alertLabels" . | nindent 10 }}

- alert: PersesHighHttpErrorRate
Expand All @@ -28,7 +28,6 @@ groups:
for: 10m
labels:
severity: warning
playbook: https://github.com/cloudoperators/greenhouse-extensions/tree/main/perses/playbooks/playbook.md
{{- include "perses.alertLabels" . | nindent 10 }}

- alert: PersesPluginSchemaLoadFailures
Expand All @@ -40,7 +39,6 @@ groups:
for: 15m
labels:
severity: warning
playbook: https://github.com/cloudoperators/greenhouse-extensions/tree/main/perses/playbooks/playbook.md
{{- include "perses.alertLabels" . | nindent 10 }}

- alert: PersesHighFileDescriptorUsage
Expand All @@ -52,5 +50,4 @@ groups:
for: 15m
labels:
severity: info
playbook: https://github.com/cloudoperators/greenhouse-extensions/tree/main/perses/playbooks/playbook.md
{{- include "perses.alertLabels" . | nindent 10 }}
103 changes: 103 additions & 0 deletions perses/playbooks/PersesServiceDown.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# Perses Service Down

## Problem
The Perses service is currently **offline**.

It is either completely stopped, crashing repeatedly, or running but refusing to respond to requests.

## Impact

* **Service Outage:** Users cannot access or view any Perses dashboards.

## Diagnosis
Follow these steps to determine if the service is crashed, "hung," running on a bad node, or turned off.

### 1. Check Pod Status
Identify the specific Perses instance referenced in the alert.

* **Option A: Check the specific namespace** (from the alert label `namespace`):

```bash
kubectl get pods -n <namespace> -l app.kubernetes.io/name=perses
```

**Analyze the Output:**

* **`CrashLoopBackOff`:** The application is starting but failing immediately. **Go to Step 3**.
* **`Running`:** The application appears healthy to Kubernetes. **Go to Step 2**.
* **`Pending`:** The cluster is out of resources or the node is tainted. **Go to Step 2**.
* **No resources found:** The list is empty. **Go to Step 4**.

### 2\. Verify Node Health

Sometimes a pod appears `Running`, but the underlying Node is disconnected (`NotReady`), causing network traffic to fail.

1. Find the Node where the pod is running:
```bash
kubectl get pods -n <namespace> -l app.kubernetes.io/name=perses -o wide
```
2. Check the status of that Node:
```bash
kubectl get node <node-name-from-previous-step>
```
* **If Status is `NotReady`:** The node is down. **Go to Resolution C**.
* **If Status is `Ready`:** The node is fine, but the process is hung. **Go to Resolution D**.

### 3\. Inspect Application Logs

If the pod status is `CrashLoopBackOff` or `Error`, check the logs to find the root cause.

```bash
kubectl logs statefulset/perses -n <namespace> --all-containers
```

### 4\. Check if Service is Scaled Down

If **Step 1** returned "No resources found," verify if the StatefulSet was scaled to 0 (maintenance or accident).

```bash
kubectl get statefulset -n <namespace> -l app.kubernetes.io/name=perses
```

* **Result `READY 0/0`:** The service is stopped. **Go to Resolution A**.

## Resolution Steps

### Scenario A: Service Scaled to 0 (Stopped)

**Diagnosis:** StatefulSet shows `0/0` replicas.

1. **Check Context:** Verify if this is a planned maintenance.
* *If Planned:* **Silence the alert** in Alertmanager.
* *If Accidental:* Start the service:
```bash
kubectl scale statefulset <statefulset-name> --replicas=1 -n <namespace>
```

### Scenario B: Configuration Error (CrashLoopBackOff)

**Diagnosis:** Logs show syntax errors or panic.

1. Rollback the Helm release if a recent change caused the crash:
```bash
helm rollback <release-name> 0 -n <namespace>
```

### Scenario C: Node Failure

**Diagnosis:** The Node hosting the pod is `NotReady`.

1. Force delete the pod. Since the node is unresponsive, a standard delete might hang.
```bash
kubectl delete pod <pod-name> -n <namespace> --grace-period=0 --force
```
*Result: The StatefulSet controller will immediately reschedule the pod onto a healthy node.*

### Scenario D: Hung Process (Unresponsive)

**Diagnosis:** Pod is `Running`, Node is `Ready`, but `up == 0`.

1. Force a restart to clear the application deadlock:
```bash
kubectl rollout restart statefulset <statefulset-name> -n <namespace>
```
3 changes: 0 additions & 3 deletions perses/playbooks/playbook.md

This file was deleted.

Loading