Skip to content

Commit fcaee84

Browse files
committed
Merge branch 'claude/dblab-prometheus-exporter-sM0EJ' into 'master'
feat: add Prometheus exporter for DBLab metrics Closes #668 See merge request postgres-ai/database-lab!1087
2 parents cb565b1 + 0956032 commit fcaee84

File tree

12 files changed

+2159
-34
lines changed

12 files changed

+2159
-34
lines changed

PROMETHEUS.md

Lines changed: 256 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,256 @@
1+
# Prometheus Metrics
2+
3+
DBLab Engine exposes Prometheus metrics via the `/metrics` endpoint. These metrics can be used to monitor the health and performance of the DBLab instance.
4+
5+
## Endpoint
6+
7+
```
8+
GET /metrics
9+
```
10+
11+
The endpoint is publicly accessible (no authentication required) and returns metrics in Prometheus text format.
12+
13+
## Available Metrics
14+
15+
### Instance Metrics
16+
17+
| Metric Name | Type | Labels | Description |
18+
|-------------|------|--------|-------------|
19+
| `dblab_instance_info` | Gauge | `instance_id`, `version`, `edition` | Information about the DBLab instance (always 1) |
20+
| `dblab_instance_uptime_seconds` | Gauge | - | Time in seconds since the DBLab instance started |
21+
| `dblab_instance_status_code` | Gauge | - | Status code of the DBLab instance (0=OK, 1=Warning, 2=Bad) |
22+
| `dblab_retrieval_status` | Gauge | `mode`, `status` | Status of data retrieval (1=active for status) |
23+
24+
### Disk/Pool Metrics
25+
26+
| Metric Name | Type | Labels | Description |
27+
|-------------|------|--------|-------------|
28+
| `dblab_disk_total_bytes` | Gauge | `pool` | Total disk space in bytes |
29+
| `dblab_disk_free_bytes` | Gauge | `pool` | Free disk space in bytes |
30+
| `dblab_disk_used_bytes` | Gauge | `pool` | Used disk space in bytes |
31+
| `dblab_disk_used_by_snapshots_bytes` | Gauge | `pool` | Disk space used by snapshots in bytes |
32+
| `dblab_disk_used_by_clones_bytes` | Gauge | `pool` | Disk space used by clones in bytes |
33+
| `dblab_disk_data_size_bytes` | Gauge | `pool` | Size of the data directory in bytes |
34+
| `dblab_disk_compress_ratio` | Gauge | `pool` | Compression ratio of the filesystem (ZFS) |
35+
| `dblab_pool_status` | Gauge | `pool`, `mode`, `status` | Status of the pool (1=active for status) |
36+
37+
### Clone Metrics (Aggregate)
38+
39+
| Metric Name | Type | Labels | Description |
40+
|-------------|------|--------|-------------|
41+
| `dblab_clones_total` | Gauge | - | Total number of clones |
42+
| `dblab_clones_by_status` | Gauge | `status` | Number of clones by status |
43+
| `dblab_clone_max_age_seconds` | Gauge | - | Maximum age of any clone in seconds |
44+
| `dblab_clone_total_diff_size_bytes` | Gauge | - | Total extra disk space used by all clones (sum of diffs from snapshots) |
45+
| `dblab_clone_total_logical_size_bytes` | Gauge | - | Total logical size of all clone data |
46+
| `dblab_clone_total_cpu_usage_percent` | Gauge | - | Total CPU usage percentage across all clone containers |
47+
| `dblab_clone_avg_cpu_usage_percent` | Gauge | - | Average CPU usage percentage across all clone containers with valid data |
48+
| `dblab_clone_total_memory_usage_bytes` | Gauge | - | Total memory usage in bytes across all clone containers |
49+
| `dblab_clone_total_memory_limit_bytes` | Gauge | - | Total memory limit in bytes across all clone containers |
50+
| `dblab_clone_protected_count` | Gauge | - | Number of protected clones |
51+
52+
### Snapshot Metrics (Aggregate)
53+
54+
| Metric Name | Type | Labels | Description |
55+
|-------------|------|--------|-------------|
56+
| `dblab_snapshots_total` | Gauge | - | Total number of snapshots |
57+
| `dblab_snapshots_by_pool` | Gauge | `pool` | Number of snapshots by pool |
58+
| `dblab_snapshot_max_age_seconds` | Gauge | - | Maximum age of any snapshot in seconds |
59+
| `dblab_snapshot_total_physical_size_bytes` | Gauge | - | Total physical disk space used by all snapshots |
60+
| `dblab_snapshot_total_logical_size_bytes` | Gauge | - | Total logical size of all snapshot data |
61+
| `dblab_snapshot_max_data_lag_seconds` | Gauge | - | Maximum data lag of any snapshot in seconds |
62+
| `dblab_snapshot_total_num_clones` | Gauge | - | Total number of clones across all snapshots |
63+
64+
### Branch Metrics
65+
66+
| Metric Name | Type | Labels | Description |
67+
|-------------|------|--------|-------------|
68+
| `dblab_branches_total` | Gauge | - | Total number of branches |
69+
70+
### Dataset Metrics
71+
72+
| Metric Name | Type | Labels | Description |
73+
|-------------|------|--------|-------------|
74+
| `dblab_datasets_total` | Gauge | `pool` | Total number of datasets (slots) in the pool |
75+
| `dblab_datasets_available` | Gauge | `pool` | Number of available (non-busy) dataset slots for reuse |
76+
77+
### Observability Metrics
78+
79+
These metrics help monitor the health of the metrics collection system itself.
80+
81+
| Metric Name | Type | Labels | Description |
82+
|-------------|------|--------|-------------|
83+
| `dblab_scrape_success_timestamp` | Gauge | - | Unix timestamp of last successful metrics collection |
84+
| `dblab_scrape_duration_seconds` | Gauge | - | Duration of last metrics collection in seconds |
85+
| `dblab_scrape_errors_total` | Counter | - | Total number of errors during metrics collection |
86+
87+
## Prometheus Configuration
88+
89+
Add the following to your `prometheus.yml`:
90+
91+
```yaml
92+
scrape_configs:
93+
- job_name: 'dblab'
94+
static_configs:
95+
- targets: ['<dblab-host>:<dblab-port>']
96+
metrics_path: /metrics
97+
```
98+
99+
## Example Queries
100+
101+
### Free Disk Space Percentage
102+
103+
```promql
104+
100 * dblab_disk_free_bytes / dblab_disk_total_bytes
105+
```
106+
107+
### Number of Active Clones
108+
109+
```promql
110+
dblab_clones_total
111+
```
112+
113+
### Maximum Clone Age in Hours
114+
115+
```promql
116+
dblab_clone_max_age_seconds / 3600
117+
```
118+
119+
### Data Freshness (lag from current time)
120+
121+
```promql
122+
dblab_snapshot_max_data_lag_seconds / 60
123+
```
124+
125+
### Total Memory Usage Across All Clones
126+
127+
```promql
128+
dblab_clone_total_memory_usage_bytes
129+
```
130+
131+
### Average CPU Usage Across All Clones
132+
133+
```promql
134+
dblab_clone_avg_cpu_usage_percent
135+
```
136+
137+
### Clones by Status
138+
139+
```promql
140+
dblab_clones_by_status
141+
```
142+
143+
### Metrics Collection Health
144+
145+
```promql
146+
time() - dblab_scrape_success_timestamp
147+
```
148+
149+
## Alerting Examples
150+
151+
### Low Disk Space Alert
152+
153+
```yaml
154+
- alert: DBLabLowDiskSpace
155+
expr: (dblab_disk_free_bytes / dblab_disk_total_bytes) * 100 < 20
156+
for: 5m
157+
labels:
158+
severity: warning
159+
annotations:
160+
summary: "DBLab low disk space"
161+
description: "DBLab pool {{ $labels.pool }} has less than 20% free disk space"
162+
```
163+
164+
### Stale Snapshot Alert
165+
166+
```yaml
167+
- alert: DBLabStaleSnapshot
168+
expr: dblab_snapshot_max_data_lag_seconds > 86400
169+
for: 10m
170+
labels:
171+
severity: warning
172+
annotations:
173+
summary: "DBLab snapshot data is stale"
174+
description: "DBLab snapshot data is more than 24 hours old"
175+
```
176+
177+
### High Clone Count Alert
178+
179+
```yaml
180+
- alert: DBLabHighCloneCount
181+
expr: dblab_clones_total > 50
182+
for: 5m
183+
labels:
184+
severity: warning
185+
annotations:
186+
summary: "DBLab has many clones"
187+
description: "DBLab has {{ $value }} clones running"
188+
```
189+
190+
### Metrics Collection Stale Alert
191+
192+
```yaml
193+
- alert: DBLabMetricsStale
194+
expr: time() - dblab_scrape_success_timestamp > 300
195+
for: 5m
196+
labels:
197+
severity: warning
198+
annotations:
199+
summary: "DBLab metrics collection is stale"
200+
description: "DBLab metrics have not been updated for more than 5 minutes"
201+
```
202+
203+
## OpenTelemetry Integration
204+
205+
DBLab metrics can be exported to OpenTelemetry-compatible backends using the OpenTelemetry Collector. This allows you to send metrics to Grafana Cloud, Datadog, New Relic, and other observability platforms.
206+
207+
### Quick Start
208+
209+
1. Install the OpenTelemetry Collector:
210+
```bash
211+
# Using Docker
212+
docker pull otel/opentelemetry-collector-contrib:latest
213+
```
214+
215+
2. Copy the example configuration:
216+
```bash
217+
cp engine/configs/otel-collector.example.yml otel-collector.yml
218+
```
219+
220+
3. Edit `otel-collector.yml` to configure your backend:
221+
```yaml
222+
exporters:
223+
otlp:
224+
endpoint: "your-otlp-endpoint:4317"
225+
headers:
226+
Authorization: "Bearer <your-token>"
227+
```
228+
229+
4. Run the collector:
230+
```bash
231+
docker run -v $(pwd)/otel-collector.yml:/etc/otelcol/config.yaml \
232+
-p 4317:4317 -p 8889:8889 \
233+
otel/opentelemetry-collector-contrib:latest
234+
```
235+
236+
### Architecture
237+
238+
```
239+
┌─────────────┐ scrape ┌──────────────────┐ OTLP ┌─────────────┐
240+
│ DBLab │ ──────────────► │ OTel Collector │ ──────────────► │ Backend │
241+
│ /metrics │ :2345 │ │ :4317 │ (Grafana, │
242+
└─────────────┘ └──────────────────┘ │ Datadog) │
243+
└─────────────┘
244+
```
245+
246+
### Supported Backends
247+
248+
The OTel Collector can export to:
249+
- **Grafana Cloud** - Use OTLP exporter with Grafana Cloud endpoint
250+
- **Datadog** - Use the datadog exporter
251+
- **New Relic** - Use OTLP exporter with New Relic endpoint
252+
- **Prometheus Remote Write** - Use prometheusremotewrite exporter
253+
- **AWS CloudWatch** - Use awsemf exporter
254+
- **Any OTLP-compatible backend**
255+
256+
See `engine/configs/otel-collector.example.yml` for a complete configuration example.

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -119,6 +119,7 @@ Read more:
119119
- Resource quotas: CPU, RAM
120120
- Monitoring & security
121121
- `/healthz` API endpoint (no auth), extended `/status` endpoint ([API docs](https://api.dblab.dev))
122+
- Prometheus metrics endpoint (`/metrics`) for monitoring
122123
- Netdata module for insights
123124

124125
## How to contribute
Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
# OpenTelemetry Collector Configuration for Database Lab Engine
2+
#
3+
# This configuration scrapes Prometheus metrics from DBLab's /metrics endpoint
4+
# and exports them via OTLP protocol to observability backends.
5+
#
6+
# Usage:
7+
# 1. Install the OpenTelemetry Collector: https://opentelemetry.io/docs/collector/installation/
8+
# 2. Copy this file and adjust the configuration for your environment
9+
# 3. Run: otelcol --config otel-collector.yml
10+
#
11+
# For Docker:
12+
# docker run -v $(pwd)/otel-collector.yml:/etc/otelcol/config.yaml \
13+
# otel/opentelemetry-collector-contrib:latest
14+
15+
receivers:
16+
# Scrape Prometheus metrics from DBLab Engine
17+
prometheus:
18+
config:
19+
scrape_configs:
20+
- job_name: 'dblab'
21+
scrape_interval: 15s
22+
scrape_timeout: 10s
23+
static_configs:
24+
- targets: ['localhost:2345']
25+
# Optional: Add labels to identify this instance
26+
relabel_configs:
27+
- source_labels: []
28+
target_label: environment
29+
replacement: 'production'
30+
31+
# Optional: Collect host metrics
32+
hostmetrics:
33+
collection_interval: 30s
34+
scrapers:
35+
cpu:
36+
memory:
37+
disk:
38+
filesystem:
39+
network:
40+
41+
processors:
42+
# Batch metrics for efficient export
43+
batch:
44+
timeout: 10s
45+
send_batch_size: 1000
46+
47+
# Add resource attributes
48+
resource:
49+
attributes:
50+
- key: service.name
51+
value: dblab-engine
52+
action: upsert
53+
- key: service.version
54+
value: "3.0"
55+
action: upsert
56+
57+
# Optional: Filter or transform metrics
58+
filter:
59+
metrics:
60+
include:
61+
match_type: regexp
62+
metric_names:
63+
- dblab_.*
64+
65+
exporters:
66+
# Export to OTLP-compatible backends (Grafana Cloud, Datadog, etc.)
67+
otlp:
68+
endpoint: "your-otlp-endpoint:4317"
69+
tls:
70+
insecure: false
71+
headers:
72+
# Add authentication headers as needed
73+
# Authorization: "Bearer <token>"
74+
75+
# Alternative: Export to Prometheus remote write
76+
prometheusremotewrite:
77+
endpoint: "https://prometheus.example.com/api/v1/write"
78+
# tls:
79+
# ca_file: /path/to/ca.crt
80+
# headers:
81+
# Authorization: "Bearer <token>"
82+
83+
# Debug: Log metrics to console (useful for testing)
84+
debug:
85+
verbosity: detailed
86+
87+
# Alternative: Keep Prometheus format for local scraping
88+
prometheus:
89+
endpoint: "0.0.0.0:8889"
90+
namespace: dblab
91+
92+
extensions:
93+
health_check:
94+
endpoint: 0.0.0.0:13133
95+
96+
zpages:
97+
endpoint: 0.0.0.0:55679
98+
99+
service:
100+
extensions: [health_check, zpages]
101+
pipelines:
102+
metrics:
103+
receivers: [prometheus]
104+
processors: [batch, resource]
105+
# Choose your exporter(s):
106+
# - Use 'otlp' for OTLP-compatible backends
107+
# - Use 'prometheusremotewrite' for Prometheus remote write
108+
# - Use 'debug' for testing
109+
exporters: [debug]
110+
111+
telemetry:
112+
logs:
113+
level: info
114+
metrics:
115+
address: 0.0.0.0:8888

0 commit comments

Comments
 (0)