A highly extensible Kubernetes metrics collector that provides Prometheus-compatible monitoring for cluster resources, infrastructure components, and external systems. Built with a modular plugin architecture, hot-reload capability, and dynamic CRD monitoring support.
Sealos State Metrics is designed as a modern alternative to kube-state-metrics with significant enhancements:
- Modular Plugin Architecture: Enable only the collectors you need, add new collectors without modifying core code
- Hot Configuration Reload: Update configuration and TLS certificates without pod restarts
- Dynamic CRD Monitoring: Monitor any Custom Resource Definition through runtime configuration, no code changes required
- Unified Management: Replace multiple scattered exporters with a single, consistent monitoring solution
- Flexible Deployment: Support both DaemonSet (node-level) and Deployment (cluster-level) modes
- Leader Election Support: Fine-grained control per collector to avoid duplicate metrics
- External System Monitoring: Built-in support for databases, domains, cloud accounts, and more
Built on a factory-based registration pattern inspired by Prometheus node_exporter:
- 8+ Built-in Collectors: Domain health, node conditions, database connectivity, LVM storage, zombie processes, image pull tracking, cloud balances, and more
- Easy to Extend: Add custom collectors by implementing a simple interface
- Lazy Initialization: Collectors are only instantiated when enabled
- Lifecycle Management: Unified start/stop/health check interfaces
Zero-downtime configuration updates:
- File-based Configuration: YAML configuration with automatic reload on changes
- Kubernetes ConfigMap Support: Detects ConfigMap updates via symlink changes
- TLS Certificate Reload: Automatically picks up cert-manager certificate rotations
- Debouncing: Intelligent 3-second delay to handle Kubernetes atomic updates
- Partial Reload: Some settings (logging, debug server, pprof) reload without stopping collectors
Monitor any Custom Resource Definition without code changes:
collectors:
crds:
crds:
- name: "my-application"
gvr:
group: "apps.example.com"
version: "v1"
resource: "applications"
metrics:
- type: "gauge"
name: "app_replicas"
path: "spec.replicas"
- type: "conditions"
name: "app_condition"
path: "status.conditions"Supported Metric Types:
info: Metadata labels (value always 1)gauge: Numeric values from resource fieldscount: Aggregate counts by field valuestring_state: Current state as a labelmap_state: State metrics for map entriesmap_gauge: Numeric values from mapsconditions: Kubernetes-style condition arrays
Three-layer configuration priority system:
- Defaults: Hard-coded in each collector
- YAML File: Loaded at startup and on reload
- Environment Variables: Highest priority, perfect for containerized environments
# config.yaml
server:
address: ":9090"
logging:
level: "info"
format: "json"
leaderElection:
enabled: true
leaseDuration: "15s"
collectors:
domain:
domains:
- example.com
checkInterval: "5m"Deployment Modes:
DaemonSet Mode (recommended for node-level monitoring):
# Each node runs collectors independently
# Non-leader collectors: lvm, zombie (node-specific)
# Leader collectors: domain, database, node (cluster-wide)Deployment Mode (multiple replicas with leader election):
# All replicas run the same collectors
# Leader election ensures only one instance collects metrics
# Better for pure cluster-level monitoringLeader Election:
- Granular control: each collector declares if it needs leader election
- Prevents duplicate metrics for cluster-wide resources
- Automatic failover when leader pod dies
- Configurable lease duration and renewal
| Collector | Type | Leader Election | Description |
|---|---|---|---|
| domain | Polling | Yes | Domain health checks, TLS certificate expiry, HTTP connectivity, DNS resolution |
| node | Informer | Yes | Kubernetes node conditions (Ready, MemoryPressure, DiskPressure, etc.) |
| database | Polling | Yes | Database connectivity monitoring (MySQL, PostgreSQL, MongoDB, Redis) via KubeBlocks |
| imagepull | Informer | Yes | Container image pull performance, slow pull detection, pull failure tracking |
| zombie | Polling | Yes | Zombie (defunct) process detection in containers |
| lvm | Polling | No | LVM storage metrics, volume group capacity and usage (node-level) |
| cloudbalance | Polling | Yes | Cloud provider account balance (Alibaba Cloud, Tencent Cloud, VolcEngine) |
| userbalance | Polling | Yes | Sealos user account balance from PostgreSQL database |
| crds | Informer | Yes | Dynamic monitoring of any Custom Resource Definition |
💡 New collectors are easy to add! See Creating Custom Collectors below.
# Add Helm repository (if available)
helm repo add sealos https://charts.sealos.io
helm repo update
# Install with default configuration
helm install sealos-state-metrics sealos/sealos-state-metrics \
--namespace monitoring \
--create-namespace
# Install with custom collectors
helm install sealos-state-metrics sealos/sealos-state-metrics \
--namespace monitoring \
--create-namespace \
--set enabledCollectors="{domain,node,database,lvm}"# Clone the repository
git clone https://github.com/labring/sealos-state-metrics.git
cd sealos-state-metrics
# Install using local Helm chart
helm install sealos-state-metrics ./deploy/charts/sealos-state-metrics \
--namespace monitoring \
--create-namespace \
--values values-custom.yaml# Run locally (requires kubeconfig)
docker run -d \
--name sealos-state-metrics \
-p 9090:9090 \
-v ~/.kube/config:/root/.kube/config:ro \
-v $(pwd)/config.yaml:/etc/sealos-state-metrics/config.yaml:ro \
ghcr.io/labring/sealos-state-metrics:latest \
--config=/etc/sealos-state-metrics/config.yamlCreate a config.yaml file:
# Server configuration
server:
address: ":9090"
metricsPath: "/metrics"
healthPath: "/health"
# Logging
logging:
level: "info" # debug, info, warn, error
format: "json" # json, text
# Leader election (required for cluster-level collectors)
leaderElection:
enabled: true
leaseName: "sealos-state-metrics"
leaseDuration: "15s"
renewDeadline: "10s"
retryPeriod: "2s"
# Metrics namespace (prefix for all metrics)
metrics:
namespace: "sealos"
# Enable collectors
enabledCollectors:
- domain
- node
- database
- lvm
# Collector-specific configuration
collectors:
domain:
domains:
- example.com
- api.example.com
checkInterval: "5m"
checkTimeout: "5s"
includeCertCheck: true
includeHTTPCheck: true
database:
checkInterval: "5m"
checkTimeout: "10s"
namespaces: [] # Empty = all namespaces
lvm:
updateInterval: "10s"All configuration can be overridden using environment variables:
# Global settings
export SERVER_ADDRESS=":8080"
export LOGGING_LEVEL="debug"
export LEADER_ELECTION_ENABLED="false"
# Collector settings
export COLLECTORS_DOMAIN_CHECK_INTERVAL="10m"
export COLLECTORS_DOMAIN_DOMAINS="example.com,test.com"
export COLLECTORS_DATABASE_NAMESPACES="default,production"
# Arrays use comma-separated values
export ENABLED_COLLECTORS="domain,node,lvm"What can be reloaded:
- Logging configuration (level, format)
- Debug server (enable/disable, port)
- Pprof server (enable/disable, port)
- All collector configurations
- Enabled collectors list
What requires restart:
- Main server address and port
- TLS configuration
- Authentication settings
Trigger reload:
# Update ConfigMap (Kubernetes will trigger reload automatically)
kubectl edit configmap sealos-state-metrics-config
# Or replace the config file and wait 3 seconds
kubectl create configmap sealos-state-metrics-config \
--from-file=config.yaml \
--dry-run=client -o yaml | kubectl apply -f -Enable ServiceMonitor in Helm values:
serviceMonitor:
enabled: true
namespace: monitoring
interval: 30s
scrapeTimeout: 10s
labels:
prometheus: kube-prometheusEnable VMServiceScrape in Helm values:
vmServiceScrape:
enabled: true
namespace: monitoring
interval: 30s
scrapeTimeout: 10sscrape_configs:
- job_name: 'sealos-state-metrics'
static_configs:
- targets: ['sealos-state-metrics.monitoring.svc:9090']
scrape_interval: 30s
scrape_timeout: 10sAll collectors expose self-monitoring metrics:
# Collector execution duration
state_metric_collector_duration_seconds{collector="domain"} 0.152
# Collector success status (1=success, 0=failure)
state_metric_collector_success{collector="database"} 1
# Last collection timestamp
state_metric_collector_last_collection_timestamp{collector="node"} 1699000000
# Domain health status
sealos_domain_health{domain="example.com",type="resolve"} 1
sealos_domain_health{domain="example.com",type="healthy_ips"} 2
# Certificate expiry (seconds until expiration)
sealos_domain_cert_expiry_seconds{domain="example.com",ip="1.2.3.4"} 2592000
# Response time
sealos_domain_response_time_seconds{domain="example.com",ip="1.2.3.4"} 0.125
# Database connectivity (1=connected, 0=disconnected)
sealos_database_connectivity{namespace="default",database="mysql1",type="mysql"} 1
# Connection response time
sealos_database_response_time_seconds{namespace="default",database="mysql1",type="mysql"} 0.089
# Total LVM capacity per node
sealos_lvm_vgs_total_capacity{node="worker-1"} 1099511627776
# Total free space per node
sealos_lvm_vgs_total_free{node="worker-1"} 549755813888
# Storage utilization
(sealos_lvm_vgs_total_capacity - sealos_lvm_vgs_total_free) / sealos_lvm_vgs_total_capacity * 100
Adding a new collector is straightforward. Here's a minimal example:
pkg/collector/mycollector/
├── config.go
├── factory.go
└── mycollector.go
// config.go
package mycollector
import "time"
type Config struct {
CheckInterval time.Duration `yaml:"checkInterval" env:"CHECK_INTERVAL"`
Enabled bool `yaml:"enabled" env:"ENABLED"`
}
func NewDefaultConfig() *Config {
return &Config{
CheckInterval: 30 * time.Second,
Enabled: true,
}
}// mycollector.go
package mycollector
import (
"context"
"github.com/labring/sealos-state-metrics/pkg/collector/base"
"github.com/prometheus/client_golang/prometheus"
)
type Collector struct {
*base.BaseCollector
config *Config
myMetric *prometheus.Desc
}
func (c *Collector) Poll(ctx context.Context) error {
// Fetch data and update metrics
return nil
}// factory.go
package mycollector
import (
"github.com/labring/sealos-state-metrics/pkg/collector"
"github.com/labring/sealos-state-metrics/pkg/registry"
)
func init() {
registry.MustRegister("mycollector", NewCollector)
}
func NewCollector(ctx *collector.FactoryContext) (collector.Collector, error) {
cfg := NewDefaultConfig()
ctx.ConfigLoader.LoadModuleConfig("collectors.mycollector", cfg)
// Create and configure collector
c := &Collector{
BaseCollector: base.NewBaseCollector("mycollector", ctx.Logger),
config: cfg,
}
return c, nil
}// pkg/collector/all/all.go
import (
_ "github.com/labring/sealos-state-metrics/pkg/collector/mycollector"
)That's it! Your collector is now available. See existing collectors for more examples:
- Simple polling: LVM Collector
- Informer-based: Node Collector
- Complex polling: Database Collector
- Go 1.23+
- Docker (for building images)
- Kubernetes cluster (for testing)
- Helm 3+ (for chart installation)
# Build binary
make build
# Build Docker image
make docker-build
# Build and push Docker image
make docker-push REGISTRY=ghcr.io/yourusername
# Run tests
make test
# Run linters
make lint# Run locally with kubeconfig
go run main.go \
--config=config.example.yaml \
--log-level=debug
# Build and run
make build
./bin/sealos-state-metrics --config=config.yaml# Install from local chart
helm install sealos-state-metrics ./deploy/charts/sealos-state-metrics \
--set image.repository=localhost:5000/sealos-state-metrics \
--set image.tag=dev \
--set image.pullPolicy=Always
# Watch logs
kubectl logs -f -l app.kubernetes.io/name=sealos-state-metrics
# Port forward for local testing
kubectl port-forward svc/sealos-state-metrics 9090:9090
# Test metrics endpoint
curl http://localhost:9090/metrics┌─────────────────────────────────────────────────────────────┐
│ Main Server │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ HTTP Server │ │ TLS Handler │ │ Auth Handler │ │
│ │ :9090 │ │ (cert-mgr) │ │ (bearer) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
┌──────────┴──────────┐
│ │
┌──────▼──────┐ ┌──────▼──────┐
│ Debug Server│ │ Pprof Server│
│ (localhost) │ │ (localhost) │
└─────────────┘ └─────────────┘
│
┌─────────────────┴─────────────────┐
│ Collector Registry │
│ (Manages all collectors) │
└───────────────┬───────────────────┘
│
┌───────────────┼───────────────┐
│ │ │
┌────▼────┐ ┌────▼────┐ ┌────▼────┐
│ Leader │ │Non-Leader│ │ Informer│
│Collectors│ │Collectors│ │ Cache │
│(domain, │ │ (lvm, │ │ │
│database)│ │ zombie) │ │ │
└─────────┘ └─────────┘ └─────────┘
Priority: Defaults → YAML File → Environment Variables
┌──────────────┐
│ Hard-coded │
│ Defaults │──┐
└──────────────┘ │
▼
┌──────────────┐ ┌──────────────┐
│ YAML Config │─→│ ConfigLoader │
│ File │ │ (composite) │
└──────────────┘ └──────┬───────┘
│
┌──────────────┐ │
│ Environment │─────────┤
│ Variables │ │
└──────────────┘ ▼
┌──────────────┐
│ Final Config │
└──────────────┘
ConfigMap Update
│
▼
┌──────────────┐ ┌──────────────┐
│ fsnotify │───→│ Debouncer │
│ Watcher │ │ (3 seconds) │
└──────────────┘ └──────┬───────┘
│
▼
┌──────────────┐
│ Validate New │
│ Config │
└──────┬───────┘
│
┌──────▼───────┐
│ Stop All │
│ Collectors │
└──────┬───────┘
│
┌──────▼───────┐
│ Reinitialize │
│ Collectors │
└──────┬───────┘
│
┌──────▼───────┐
│ Start │
│ Collectors │
└──────────────┘
| Feature | kube-state-metrics | Sealos State Metrics |
|---|---|---|
| Kubernetes Resources | ✅ All built-in resources | ✅ Informer-based collectors |
| Custom Resources (CRD) | ✅ Dynamic (runtime config) | |
| Configuration | ✅ YAML + Env vars + Hot reload | |
| Hot Reload | ❌ Not supported | ✅ Full support |
| External Systems | ❌ Kubernetes only | ✅ Databases, Domains, Cloud |
| Plugin Architecture | ❌ Monolithic | ✅ Modular collectors |
| Deployment Modes | ✅ DaemonSet + Deployment | |
| Leader Election | ✅ Per-collector control | |
| Memory Optimization | ✅ Built-in | ✅ Transform support |
kubectl get pods -n monitoring -l app.kubernetes.io/name=sealos-state-metrics
kubectl describe pod -n monitoring -l app.kubernetes.io/name=sealos-state-metrics# Follow logs
kubectl logs -n monitoring -l app.kubernetes.io/name=sealos-state-metrics -f
# Search for errors
kubectl logs -n monitoring -l app.kubernetes.io/name=sealos-state-metrics | grep -i error
# View specific collector logs
kubectl logs -n monitoring -l app.kubernetes.io/name=sealos-state-metrics | grep "collector=domain"# Port forward
kubectl port-forward -n monitoring svc/sealos-state-metrics 9090:9090
# Fetch metrics
curl http://localhost:9090/metrics
# Check specific collector metrics
curl http://localhost:9090/metrics | grep sealos_domain
# Check framework metrics
curl http://localhost:9090/metrics | grep state_metric_collector# Check health endpoint
curl http://localhost:9090/health
# Expected response:
# {"status":"ok","collectors":{"domain":"healthy","node":"healthy"}}# Check lease
kubectl get lease -n monitoring sealos-state-metrics -o yaml
# Check holder identity
kubectl get lease -n monitoring sealos-state-metrics \
-o jsonpath='{.spec.holderIdentity}'Issue: Collector not starting
# Check RBAC permissions
kubectl auth can-i get nodes --as=system:serviceaccount:monitoring:sealos-state-metrics
# View collector-specific logs
kubectl logs -n monitoring -l app.kubernetes.io/name=sealos-state-metrics | grep "collector=yourname"Issue: No metrics for a collector
# Verify collector is enabled
kubectl get configmap -n monitoring sealos-state-metrics-config -o yaml
# Check collector health
curl http://localhost:9090/healthIssue: Configuration not reloading
# Check file watcher logs
kubectl logs -n monitoring -l app.kubernetes.io/name=sealos-state-metrics | grep "reload"
# Trigger manual restart (if needed)
kubectl rollout restart deployment/sealos-state-metrics -n monitoringThe application requires the following Kubernetes permissions:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: sealos-state-metrics
rules:
# Core resources
- apiGroups: [""]
resources: ["nodes", "pods", "services", "secrets", "namespaces"]
verbs: ["get", "list", "watch"]
# Events (for imagepull collector)
- apiGroups: [""]
resources: ["events"]
verbs: ["get", "list", "watch"]
# Leader election
- apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
verbs: ["get", "create", "update"]Some collectors (database, cloudbalance) require access to Secrets:
- Use namespace-based RBAC to limit secret access
- Consider using external secret managers (Vault, External Secrets Operator)
- Rotate credentials regularly
- Monitor secret access via audit logs
If using network policies, allow:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: sealos-state-metrics
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: sealos-state-metrics
policyTypes:
- Ingress
- Egress
ingress:
# Allow Prometheus to scrape metrics
- from:
- namespaceSelector:
matchLabels:
name: monitoring
ports:
- protocol: TCP
port: 9090
egress:
# Allow Kubernetes API access
- to:
- namespaceSelector: {}
podSelector:
matchLabels:
component: apiserver
ports:
- protocol: TCP
port: 6443
# Allow DNS
- to:
- namespaceSelector:
matchLabels:
name: kube-system
ports:
- protocol: UDP
port: 53Small Clusters (< 50 nodes):
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
cpu: 500m
memory: 256MiMedium Clusters (50-200 nodes):
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 1000m
memory: 512MiLarge Clusters (> 200 nodes):
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 2000m
memory: 1Gi# Reduce Kubernetes API load
kubernetes:
qps: 50 # Increase for large clusters
burst: 100 # Burst capacity
# Adjust collector intervals
collectors:
domain:
checkInterval: "10m" # Increase for less critical checks
database:
checkInterval: "5m" # Balance between freshness and load
checkTimeout: "10s" # Adjust based on network latency
zombie:
checkInterval: "30s" # Frequent for quick detectionWe welcome contributions! Please see CONTRIBUTING.md for guidelines.
- Report Bugs: Open an issue with reproduction steps
- Suggest Features: Describe your use case and proposed solution
- Submit PRs: Fork, create a feature branch, and submit a pull request
- Add Collectors: Create new collectors for community benefit
- Improve Docs: Fix typos, add examples, clarify explanations
# Fork and clone
git clone https://github.com/yourusername/sealos-state-metrics.git
cd sealos-state-metrics
# Create feature branch
git checkout -b feature/my-collector
# Make changes and test
make test
make lint
# Commit and push
git commit -m "feat: add my-collector for monitoring X"
git push origin feature/my-collector
# Open pull requestLicensed under the Apache License, Version 2.0. See LICENSE for full text.
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Community: Sealos Community
- Documentation: Full Documentation
- Sealos - Cloud operating system based on Kubernetes
- kube-state-metrics - Original inspiration
- Prometheus - Metrics collection and alerting
- VictoriaMetrics - Time series database
- Cardinality control and metrics filtering
- Plugin discovery and dynamic loading
- Multi-cluster support
- OpenTelemetry support (OTLP exporter)
- Built-in recording rules (VMRule integration)
- Alerting integration (AlertManager direct support)
- Stability guarantees
- Performance benchmarks
- Production hardening
Made with ❤️ by the Sealos team