Skip to content

Commit eeb78a9

Browse files
authored
fix(thanos) compact halted playbook (#1067)
* Expand Thanos compactor troubleshooting playbook * update steps to find compactor secret + output of thanos tools command
1 parent 5b33a66 commit eeb78a9

File tree

4 files changed

+168
-4
lines changed

4 files changed

+168
-4
lines changed

thanos/charts/Chart.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ maintainers:
1111
name: thanos
1212
sources:
1313
- https://github.com/cloudoperators/greenhouse-extensions
14-
version: 0.5.32
14+
version: 0.5.33
1515
keywords:
1616
- thanos
1717
- storage

thanos/charts/alerts/compactor.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ groups:
1919
for: 5m
2020
labels:
2121
severity: info
22-
playbook: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanoscompacthalted
22+
playbook: https://github.com/cloudoperators/greenhouse-extensions/thanos/playbooks/ThanosCompactHalted.md
2323
{{ tpl .Values.thanos.serviceMonitor.alertLabels . | nindent 6 }}
2424
- alert: ThanosCompactHighCompactionFailures
2525
annotations:
Lines changed: 164 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,164 @@
1+
2+
# Thanos Compactor Troubleshooting Playbook
3+
4+
## Table of Contents
5+
6+
1. [Thanos Compactor Halted Due to Overlapping Blocks](#thanos-compactor-halted-due-to-overlapping-blocks)
7+
2. [Thanos component has disappeared](#Thanos-component-has-disappeared)
8+
3. [Chunk critical error](#Chunk-critical-error)
9+
10+
---
11+
12+
## Thanos Compactor Halted Due to Overlapping Blocks
13+
14+
### Problem
15+
The Thanos compactor has halted with an error similar to:
16+
```
17+
critical error detected; halting" err="compaction: ... pre compaction overlap check: overlaps found while gathering blocks. ...
18+
```
19+
Example:
20+
```
21+
s=2025-07-25T11:34:26.357181007Z caller=compact.go:559 level=error msg="critical error detected; halting" err="compaction: group 0@3105571489545500179: pre compaction overlap check: overlaps found while gathering blocks. [mint: 1742680710957, maxt: 1742680800000, range: 1m29s, blocks: 2]: <ulid: 01JQ01BAXKEBJSXT0QJ3PAQ63V, mint: 1742673600011, maxt: 1742680800000, range: 1h59m59s>, <ulid: 01JQ3P5BSQK656HMXMW7DREQ3D, mint: 1742680710957, maxt: 1742680800000, range: 1m29s>\n[mint: 1742680800517, maxt: 1742688000000, range: 1h59m59s, blocks: 2]: <ulid: 01JQ08725PXK4D6XD0SB1WYT2E, mint: 1742680800020, maxt: 1742688000000, range: 1h59m59s>, <ulid: 01JQ3P5EH930NGCFHTT9Z87TEY, mint: 1742680800517, maxt: 1742688000000, range: 1h59m59s>"
22+
```
23+
24+
This is caused by overlapping blocks in your object storage.
25+
26+
---
27+
28+
### Resolution Steps
29+
30+
#### 1. Identify Overlapping Blocks
31+
32+
- Enter the compactor container or a pod with Thanos CLI access.
33+
34+
- Run: ``` thanos tools bucket inspect --objstore.config-file=<your-objstore.yaml> | grep <block_id> ```
35+
36+
- You will see output similar to the following, showing details for each block ULID:
37+
38+
| ULID | MIN TIME | MAX TIME | DURATION | AGE | NUM SAMPLES | NUM SERIES | NUM CHUNKS | LEVEL | COMPACTED | LABELS | DELETION | SOURCE |
39+
|---------------------------|---------------------|---------------------|-----------------|------------------|-------------|----------------|-------------|------------|-------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------|-----------|
40+
| 01JWTG9RS2PVHHFZKFCCBZH0RV | 2025-06-03T06:00:00Z | 2025-06-03T08:00:00Z | 1h59m59.95s | 38h0m0.05s | 193,216 | 44,122,225 | 379,866 | 1 | false | cluster=cluster-us,cluster_type=observability,organization=ccloud,prometheus=kube-monitoring/kube-monitoring,prometheus_replica=prometheus-kube-monitoring-0,region=us | 0s | sidecar |
41+
| 01JWZ95H85J8SZEN09Z3W8MQPP | 2025-06-03T08:00:00Z | 2025-06-05T00:00:00Z | 40h0m0s | 0s | 411,006 | 903,753,062 | 7,661,746 | 3 | false | cluster=cluster-us,cluster_type=observability,prometheus=kube-monitoring/kube-monitoring,prometheus_replica=prometheus-kube-monitoring-0,region=us | 0s | compactor |
42+
43+
This table helps you to identify block time ranges, sizes, and sources for troubleshooting overlaps.
44+
45+
- Look for blocks with overlapping `mint` and `maxt` time ranges. Note the ULIDs of the offending blocks.
46+
47+
- Use `| grep <block_id>` to compare block sizes and timestamps from the error log.
48+
49+
---
50+
51+
#### 2. Decide Which Blocks to Delete
52+
53+
- Prefer deleting the block with the **shorter time range** or the one that appears incomplete (smaller size).
54+
- Example from logs:
55+
- `01JQ3P5BSQK656HMXMW7DREQ3D`
56+
- `01JQ3P5EH930NGCFHTT9Z87TEY`
57+
58+
---
59+
60+
#### 3. Remove Overlapping Blocks
61+
62+
- Verify which objectstore bucket is used by your Thanos instance
63+
64+
- Navigate to your object store UI
65+
66+
- Search and delete block IDs causing overlap
67+
68+
- **Warning:** This will permanently delete the specified blocks. Always back up your data before proceeding.
69+
70+
---
71+
72+
#### 4. Restart the Compactor
73+
74+
- Restart the Thanos compactor deployment/pod to resume normal operation.
75+
76+
---
77+
78+
### Additional Notes
79+
80+
- Overlapping blocks are often caused by misconfigured Prometheus, clock skew, or manual uploads.
81+
- If unsure which block to delete, consult your team or inspect block metadata.
82+
- For more details, see [Thanos Compactor documentation](https://thanos.io/tip/components/compact.md).
83+
---
84+
85+
## Thanos component has disappeared.
86+
87+
### Problem
88+
89+
The Thanos job (thanos-...-compactor) responsible for shrinking the store data is not running anymore.
90+
The error itself is harmless, no productive impact, but it should be fixed to avoid unnecessary growth of the swift store.
91+
92+
## Solution
93+
94+
1. Check the logs of the Thanos compactor in question
95+
96+
```bash
97+
kubectl logs --follow $podName
98+
```
99+
100+
Usually you would see some critical error like this
101+
```
102+
level=error ts=2023-04-01T21:44:13.805210208Z caller=compact.go:488 msg="critical error detected; halting" err="compaction: group 0@16113311641286135401: compact blocks [/data/compact/0@16113311641286135401/01GWY7K16T83Y35W654CWB8W15 /data/compact/0@16113311641286135401/01GWYEER9Q3FSY4ST5M2J32SJ0 /data/compact/0@16113311641286135401/01GWYNAFHNAZY9SV5JTB3W1C07 /data/compact/0@16113311641286135401/01GWYW66T6WDGQD3G7HDJ6GGCT]: 2 errors: populate block: context canceled; context canceled"
103+
```
104+
105+
```
106+
level=info ts=2023-04-27T05:43:18.591882979Z caller=http.go:103 service=http/server component=compact msg="internal server is shutdown gracefully" err="could not sync metas: filter metas: filter blocks marked for deletion: get file: 01GYH4WNQKAXY1Y4RN1K5J1R8Q/deletion-mark.json: open object: Timeout when reading or writing data"
107+
level=info ts=2023-04-27T05:43:18.591951424Z caller=intrumentation.go:81 msg="changing probe status" status=not-healthy reason="could not sync metas: filter metas: filter blocks marked for deletion: get file: 01GYH4WNQKAXY1Y4RN1K5J1R8Q/deletion-mark.json: open object: Timeout when reading or writing data"
108+
level=error ts=2023-04-27T05:43:19.742219509Z caller=compact.go:488 msg="critical error detected; halting" err="compaction: group 0@2754565673212689501: compact blocks [/data/compact/0@2754565673212689501/01GYZEZNMR42383F8BXZHE91A0 /data/compact/0@2754565673212689501/01GYZNVCWRNMVJZYY7MZD4SJQM /data/compact/0@2754565673212689501/01GYZWQ44S29Y97MV0CKBEVKRA /data/compact/0@2754565673212689501/01GZ03JVCRRQ9E6HAQ9333P9SY]: 2 errors: populate block: context canceled; context canceled"
109+
```
110+
111+
2. Kick the pod.
112+
113+
3. Check the log again like in in step `1.`.
114+
115+
## Chunk critical error
116+
117+
### Problem
118+
119+
Some block error prohibits Thanos from continuing to compact. The faulty block needs to be identified and removed in your object storage (Swift, S3, Ceph, ...).
120+
121+
### Solution
122+
123+
#### 1. Check the logs of the Thanos compactor in question
124+
125+
```bash
126+
kubectl logs --follow $podName
127+
```
128+
129+
Usually you would see some critical error like this:
130+
```
131+
level=error ts=2023-11-09T12:23:35.120966651Z caller=compact.go:487 msg="critical error detected; halting" err="compaction: group 0@7529044506654606473: compact blocks [/data/compact/0@7529044506654606473/01HE8PG6C507J89N60T3SKQN3S /data/compact/0@7529044506654606473/01HE8XBXM3AP4WDCS25DKJ6W4J /data/compact/0@7529044506654606473/01HE947MW4JBDY3648JSVYKPAW /data/compact/0@7529044506654606473/01HE9B3C499RNJ4SM5V8EKM0ZG]: populate block: chunk iter: cannot populate chunk 8 from block 01HE9B3C499RNJ4SM5V8EKM0ZG: segment index 0 out of range"
132+
133+
```
134+
135+
If you can't see it, kick the pod and watch the logs straight, to catch the initial error. It might be obfuscated by appending messages by the time you are looking at it.
136+
137+
#### 2. Find the Secret CR Used by Your Thanos Compactor
138+
- Check the `Deployment` or `StatefulSet` manifest for the Thanos compactor to find the name of the Secret containing the object storage configuration (often referenced as `--objstore.config` or `--objstore.config-file`).
139+
- Example command to find the Secret reference:
140+
```bash
141+
kubectl get deployment -n <namespace> -o yaml | grep objstore
142+
```
143+
- Once you have the Secret name, retrieve its contents:
144+
```bash
145+
kubectl get secret <secret-name> -n <namespace> -o yaml
146+
```
147+
- Review the Secret to identify the object storage endpoint, bucket, and credentials. This tells you which object storage instance you need to access to delete the problematic blocks.
148+
149+
#### 3. Remove faulty Blocks
150+
151+
- Delete the folder, it is empty anyway. You are also safe to delete it, if it has no `chunks` folder.
152+
153+
- Verify which objectstore bucket is used by your Thanos instance
154+
155+
- Navigate to your object store UI
156+
157+
- Search and delete block IDs causing faults
158+
159+
- **Warning:** This will permanently delete the specified blocks. Always back up your data before proceeding.
160+
161+
#### 4. Kick the pod.
162+
163+
```kubectl delete pod thanos-kubernetes-compactor-....```
164+

thanos/plugindefinition.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,12 +6,12 @@ kind: PluginDefinition
66
metadata:
77
name: thanos
88
spec:
9-
version: 0.5.36
9+
version: 0.5.37
1010
description: thanos
1111
helmChart:
1212
name: thanos
1313
repository: "oci://ghcr.io/cloudoperators/greenhouse-extensions/charts"
14-
version: 0.5.32
14+
version: 0.5.33
1515
options:
1616
- default: null
1717
description: CLI param for Thanos Query

0 commit comments

Comments
 (0)