Sealos State Metrics

A highly extensible Kubernetes metrics collector that provides Prometheus-compatible monitoring for cluster resources, infrastructure components, and external systems. Built with a modular plugin architecture, hot-reload capability, and dynamic CRD monitoring support.

Overview

Sealos State Metrics is designed as a modern alternative to kube-state-metrics with significant enhancements:

Modular Plugin Architecture: Enable only the collectors you need, add new collectors without modifying core code
Hot Configuration Reload: Update configuration and TLS certificates without pod restarts
Dynamic CRD Monitoring: Monitor any Custom Resource Definition through runtime configuration, no code changes required
Unified Management: Replace multiple scattered exporters with a single, consistent monitoring solution
Flexible Deployment: Support both DaemonSet (node-level) and Deployment (cluster-level) modes
Leader Election Support: Fine-grained control per collector to avoid duplicate metrics
External System Monitoring: Built-in support for databases, domains, cloud accounts, and more

Key Features

🔌 Extensible Collector System

Built on a factory-based registration pattern inspired by Prometheus node_exporter:

8+ Built-in Collectors: Domain health, node conditions, database connectivity, LVM storage, zombie processes, image pull tracking, cloud balances, and more
Easy to Extend: Add custom collectors by implementing a simple interface
Lazy Initialization: Collectors are only instantiated when enabled
Lifecycle Management: Unified start/stop/health check interfaces

🔥 Hot Configuration Reload

Zero-downtime configuration updates:

File-based Configuration: YAML configuration with automatic reload on changes
Kubernetes ConfigMap Support: Detects ConfigMap updates via symlink changes
TLS Certificate Reload: Automatically picks up cert-manager certificate rotations
Debouncing: Intelligent 3-second delay to handle Kubernetes atomic updates
Partial Reload: Some settings (logging, debug server, pprof) reload without stopping collectors

🎯 Dynamic CRD Monitoring

Monitor any Custom Resource Definition without code changes:

collectors:
  crds:
    crds:
      - name: "my-application"
        gvr:
          group: "apps.example.com"
          version: "v1"
          resource: "applications"
        metrics:
          - type: "gauge"
            name: "app_replicas"
            path: "spec.replicas"
          - type: "conditions"
            name: "app_condition"
            path: "status.conditions"

Supported Metric Types:

info: Metadata labels (value always 1)
gauge: Numeric values from resource fields
count: Aggregate counts by field value
string_state: Current state as a label
map_state: State metrics for map entries
map_gauge: Numeric values from maps
conditions: Kubernetes-style condition arrays

🎛️ Flexible Configuration

Three-layer configuration priority system:

Defaults: Hard-coded in each collector
YAML File: Loaded at startup and on reload
Environment Variables: Highest priority, perfect for containerized environments

# config.yaml
server:
  address: ":9090"

logging:
  level: "info"
  format: "json"

leaderElection:
  enabled: true
  leaseDuration: "15s"

collectors:
  domain:
    domains:
      - example.com
    checkInterval: "5m"

🏗️ Production-Ready Architecture

Deployment Modes:

DaemonSet Mode (recommended for node-level monitoring):

# Each node runs collectors independently
# Non-leader collectors: lvm, zombie (node-specific)
# Leader collectors: domain, database, node (cluster-wide)

Deployment Mode (multiple replicas with leader election):

# All replicas run the same collectors
# Leader election ensures only one instance collects metrics
# Better for pure cluster-level monitoring

Leader Election:

Granular control: each collector declares if it needs leader election
Prevents duplicate metrics for cluster-wide resources
Automatic failover when leader pod dies
Configurable lease duration and renewal

Available Collectors

Collector	Type	Leader Election	Description
domain	Polling	Yes	Domain health checks, TLS certificate expiry, HTTP connectivity, DNS resolution
node	Informer	Yes	Kubernetes node conditions (Ready, MemoryPressure, DiskPressure, etc.)
database	Polling	Yes	Database connectivity monitoring (MySQL, PostgreSQL, MongoDB, Redis) via KubeBlocks
imagepull	Informer	Yes	Container image pull performance, slow pull detection, pull failure tracking
zombie	Polling	Yes	Zombie (defunct) process detection in containers
lvm	Polling	No	LVM storage metrics, volume group capacity and usage (node-level)
cloudbalance	Polling	Yes	Cloud provider account balance (Alibaba Cloud, Tencent Cloud, VolcEngine)
userbalance	Polling	Yes	Sealos user account balance from PostgreSQL database
crds	Informer	Yes	Dynamic monitoring of any Custom Resource Definition

💡 New collectors are easy to add! See Creating Custom Collectors below.

Quick Start

Installation via Helm

# Add Helm repository (if available)
helm repo add sealos https://charts.sealos.io
helm repo update

# Install with default configuration
helm install sealos-state-metrics sealos/sealos-state-metrics \
  --namespace monitoring \
  --create-namespace

# Install with custom collectors
helm install sealos-state-metrics sealos/sealos-state-metrics \
  --namespace monitoring \
  --create-namespace \
  --set enabledCollectors="{domain,node,database,lvm}"

Installation from Source

# Clone the repository
git clone https://github.com/labring/sealos-state-metrics.git
cd sealos-state-metrics

# Install using local Helm chart
helm install sealos-state-metrics ./deploy/charts/sealos-state-metrics \
  --namespace monitoring \
  --create-namespace \
  --values values-custom.yaml

Docker Deployment

# Run locally (requires kubeconfig)
docker run -d \
  --name sealos-state-metrics \
  -p 9090:9090 \
  -v ~/.kube/config:/root/.kube/config:ro \
  -v $(pwd)/config.yaml:/etc/sealos-state-metrics/config.yaml:ro \
  ghcr.io/labring/sealos-state-metrics:latest \
  --config=/etc/sealos-state-metrics/config.yaml

Configuration

Basic Configuration

Create a config.yaml file:

# Server configuration
server:
  address: ":9090"
  metricsPath: "/metrics"
  healthPath: "/health"

# Logging
logging:
  level: "info"  # debug, info, warn, error
  format: "json" # json, text

# Leader election (required for cluster-level collectors)
leaderElection:
  enabled: true
  leaseName: "sealos-state-metrics"
  leaseDuration: "15s"
  renewDeadline: "10s"
  retryPeriod: "2s"

# Metrics namespace (prefix for all metrics)
metrics:
  namespace: "sealos"

# Enable collectors
enabledCollectors:
  - domain
  - node
  - database
  - lvm

# Collector-specific configuration
collectors:
  domain:
    domains:
      - example.com
      - api.example.com
    checkInterval: "5m"
    checkTimeout: "5s"
    includeCertCheck: true
    includeHTTPCheck: true

  database:
    checkInterval: "5m"
    checkTimeout: "10s"
    namespaces: []  # Empty = all namespaces

  lvm:
    updateInterval: "10s"

Environment Variable Overrides

All configuration can be overridden using environment variables:

# Global settings
export SERVER_ADDRESS=":8080"
export LOGGING_LEVEL="debug"
export LEADER_ELECTION_ENABLED="false"

# Collector settings
export COLLECTORS_DOMAIN_CHECK_INTERVAL="10m"
export COLLECTORS_DOMAIN_DOMAINS="example.com,test.com"
export COLLECTORS_DATABASE_NAMESPACES="default,production"

# Arrays use comma-separated values
export ENABLED_COLLECTORS="domain,node,lvm"

Configuration Hot Reload

What can be reloaded:

Logging configuration (level, format)
Debug server (enable/disable, port)
Pprof server (enable/disable, port)
All collector configurations
Enabled collectors list

What requires restart:

Main server address and port
TLS configuration
Authentication settings

Trigger reload:

# Update ConfigMap (Kubernetes will trigger reload automatically)
kubectl edit configmap sealos-state-metrics-config

# Or replace the config file and wait 3 seconds
kubectl create configmap sealos-state-metrics-config \
  --from-file=config.yaml \
  --dry-run=client -o yaml | kubectl apply -f -

Monitoring Integration

Prometheus Operator

Enable ServiceMonitor in Helm values:

serviceMonitor:
  enabled: true
  namespace: monitoring
  interval: 30s
  scrapeTimeout: 10s
  labels:
    prometheus: kube-prometheus

VictoriaMetrics Operator

Enable VMServiceScrape in Helm values:

vmServiceScrape:
  enabled: true
  namespace: monitoring
  interval: 30s
  scrapeTimeout: 10s

Manual Prometheus Configuration

scrape_configs:
  - job_name: 'sealos-state-metrics'
    static_configs:
      - targets: ['sealos-state-metrics.monitoring.svc:9090']
    scrape_interval: 30s
    scrape_timeout: 10s

Metrics Examples

Framework Metrics

All collectors expose self-monitoring metrics:

# Collector execution duration
state_metric_collector_duration_seconds{collector="domain"} 0.152

# Collector success status (1=success, 0=failure)
state_metric_collector_success{collector="database"} 1

# Last collection timestamp
state_metric_collector_last_collection_timestamp{collector="node"} 1699000000

Domain Collector Metrics

# Domain health status
sealos_domain_health{domain="example.com",type="resolve"} 1
sealos_domain_health{domain="example.com",type="healthy_ips"} 2

# Certificate expiry (seconds until expiration)
sealos_domain_cert_expiry_seconds{domain="example.com",ip="1.2.3.4"} 2592000

# Response time
sealos_domain_response_time_seconds{domain="example.com",ip="1.2.3.4"} 0.125

Database Collector Metrics

# Database connectivity (1=connected, 0=disconnected)
sealos_database_connectivity{namespace="default",database="mysql1",type="mysql"} 1

# Connection response time
sealos_database_response_time_seconds{namespace="default",database="mysql1",type="mysql"} 0.089

LVM Collector Metrics

# Total LVM capacity per node
sealos_lvm_vgs_total_capacity{node="worker-1"} 1099511627776

# Total free space per node
sealos_lvm_vgs_total_free{node="worker-1"} 549755813888

# Storage utilization
(sealos_lvm_vgs_total_capacity - sealos_lvm_vgs_total_free) / sealos_lvm_vgs_total_capacity * 100

Creating Custom Collectors

Adding a new collector is straightforward. Here's a minimal example:

1. Create Directory Structure

pkg/collector/mycollector/
├── config.go
├── factory.go
└── mycollector.go

2. Define Configuration

// config.go
package mycollector

import "time"

type Config struct {
    CheckInterval time.Duration `yaml:"checkInterval" env:"CHECK_INTERVAL"`
    Enabled       bool          `yaml:"enabled" env:"ENABLED"`
}

func NewDefaultConfig() *Config {
    return &Config{
        CheckInterval: 30 * time.Second,
        Enabled:       true,
    }
}

3. Implement Collector

// mycollector.go
package mycollector

import (
    "context"
    "github.com/labring/sealos-state-metrics/pkg/collector/base"
    "github.com/prometheus/client_golang/prometheus"
)

type Collector struct {
    *base.BaseCollector
    config *Config
    myMetric *prometheus.Desc
}

func (c *Collector) Poll(ctx context.Context) error {
    // Fetch data and update metrics
    return nil
}

4. Create Factory

// factory.go
package mycollector

import (
    "github.com/labring/sealos-state-metrics/pkg/collector"
    "github.com/labring/sealos-state-metrics/pkg/registry"
)

func init() {
    registry.MustRegister("mycollector", NewCollector)
}

func NewCollector(ctx *collector.FactoryContext) (collector.Collector, error) {
    cfg := NewDefaultConfig()
    ctx.ConfigLoader.LoadModuleConfig("collectors.mycollector", cfg)

    // Create and configure collector
    c := &Collector{
        BaseCollector: base.NewBaseCollector("mycollector", ctx.Logger),
        config: cfg,
    }

    return c, nil
}

5. Register in all.go

// pkg/collector/all/all.go
import (
    _ "github.com/labring/sealos-state-metrics/pkg/collector/mycollector"
)

That's it! Your collector is now available. See existing collectors for more examples:

Simple polling: LVM Collector
Informer-based: Node Collector
Complex polling: Database Collector

Development

Prerequisites

Go 1.23+
Docker (for building images)
Kubernetes cluster (for testing)
Helm 3+ (for chart installation)

Building

# Build binary
make build

# Build Docker image
make docker-build

# Build and push Docker image
make docker-push REGISTRY=ghcr.io/yourusername

# Run tests
make test

# Run linters
make lint

Local Development

# Run locally with kubeconfig
go run main.go \
  --config=config.example.yaml \
  --log-level=debug

# Build and run
make build
./bin/sealos-state-metrics --config=config.yaml

Testing in Kubernetes

# Install from local chart
helm install sealos-state-metrics ./deploy/charts/sealos-state-metrics \
  --set image.repository=localhost:5000/sealos-state-metrics \
  --set image.tag=dev \
  --set image.pullPolicy=Always

# Watch logs
kubectl logs -f -l app.kubernetes.io/name=sealos-state-metrics

# Port forward for local testing
kubectl port-forward svc/sealos-state-metrics 9090:9090

# Test metrics endpoint
curl http://localhost:9090/metrics

Architecture

Component Overview

┌─────────────────────────────────────────────────────────────┐
│                     Main Server                             │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │ HTTP Server  │  │ TLS Handler  │  │ Auth Handler │      │
│  │ :9090        │  │ (cert-mgr)   │  │ (bearer)     │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
└─────────────────────────────────────────────────────────────┘
                           │
                ┌──────────┴──────────┐
                │                     │
         ┌──────▼──────┐      ┌──────▼──────┐
         │ Debug Server│      │ Pprof Server│
         │ (localhost) │      │ (localhost) │
         └─────────────┘      └─────────────┘
                           │
         ┌─────────────────┴─────────────────┐
         │     Collector Registry            │
         │  (Manages all collectors)         │
         └───────────────┬───────────────────┘
                         │
         ┌───────────────┼───────────────┐
         │               │               │
    ┌────▼────┐    ┌────▼────┐    ┌────▼────┐
    │ Leader  │    │Non-Leader│   │ Informer│
    │Collectors│    │Collectors│   │  Cache  │
    │(domain, │    │  (lvm,   │   │         │
    │database)│    │ zombie)  │   │         │
    └─────────┘    └─────────┘    └─────────┘

Configuration Flow

Priority: Defaults → YAML File → Environment Variables

┌──────────────┐
│ Hard-coded   │
│  Defaults    │──┐
└──────────────┘  │
                  ▼
┌──────────────┐  ┌──────────────┐
│ YAML Config  │─→│ ConfigLoader │
│   File       │  │ (composite)  │
└──────────────┘  └──────┬───────┘
                         │
┌──────────────┐         │
│ Environment  │─────────┤
│  Variables   │         │
└──────────────┘         ▼
                  ┌──────────────┐
                  │ Final Config │
                  └──────────────┘

Hot Reload Flow

ConfigMap Update
      │
      ▼
┌──────────────┐    ┌──────────────┐
│  fsnotify    │───→│  Debouncer   │
│  Watcher     │    │  (3 seconds) │
└──────────────┘    └──────┬───────┘
                           │
                           ▼
                    ┌──────────────┐
                    │ Validate New │
                    │   Config     │
                    └──────┬───────┘
                           │
                    ┌──────▼───────┐
                    │ Stop All     │
                    │ Collectors   │
                    └──────┬───────┘
                           │
                    ┌──────▼───────┐
                    │ Reinitialize │
                    │ Collectors   │
                    └──────┬───────┘
                           │
                    ┌──────▼───────┐
                    │ Start        │
                    │ Collectors   │
                    └──────────────┘

Comparison with kube-state-metrics

Feature	kube-state-metrics	Sealos State Metrics
Kubernetes Resources	✅ All built-in resources	✅ Informer-based collectors
Custom Resources (CRD)	⚠️ Static (compile-time)	✅ Dynamic (runtime config)
Configuration	⚠️ CLI flags only	✅ YAML + Env vars + Hot reload
Hot Reload	❌ Not supported	✅ Full support
External Systems	❌ Kubernetes only	✅ Databases, Domains, Cloud
Plugin Architecture	❌ Monolithic	✅ Modular collectors
Deployment Modes	⚠️ Single mode	✅ DaemonSet + Deployment
Leader Election	⚠️ All-or-nothing	✅ Per-collector control
Memory Optimization	✅ Built-in	✅ Transform support

Troubleshooting

Check Pod Status

kubectl get pods -n monitoring -l app.kubernetes.io/name=sealos-state-metrics
kubectl describe pod -n monitoring -l app.kubernetes.io/name=sealos-state-metrics

View Logs

# Follow logs
kubectl logs -n monitoring -l app.kubernetes.io/name=sealos-state-metrics -f

# Search for errors
kubectl logs -n monitoring -l app.kubernetes.io/name=sealos-state-metrics | grep -i error

# View specific collector logs
kubectl logs -n monitoring -l app.kubernetes.io/name=sealos-state-metrics | grep "collector=domain"

Access Metrics Endpoint

# Port forward
kubectl port-forward -n monitoring svc/sealos-state-metrics 9090:9090

# Fetch metrics
curl http://localhost:9090/metrics

# Check specific collector metrics
curl http://localhost:9090/metrics | grep sealos_domain

# Check framework metrics
curl http://localhost:9090/metrics | grep state_metric_collector

Health Check

# Check health endpoint
curl http://localhost:9090/health

# Expected response:
# {"status":"ok","collectors":{"domain":"healthy","node":"healthy"}}

Leader Election Status

# Check lease
kubectl get lease -n monitoring sealos-state-metrics -o yaml

# Check holder identity
kubectl get lease -n monitoring sealos-state-metrics \
  -o jsonpath='{.spec.holderIdentity}'

Common Issues

Issue: Collector not starting

# Check RBAC permissions
kubectl auth can-i get nodes --as=system:serviceaccount:monitoring:sealos-state-metrics

# View collector-specific logs
kubectl logs -n monitoring -l app.kubernetes.io/name=sealos-state-metrics | grep "collector=yourname"

Issue: No metrics for a collector

# Verify collector is enabled
kubectl get configmap -n monitoring sealos-state-metrics-config -o yaml

# Check collector health
curl http://localhost:9090/health

Issue: Configuration not reloading

# Check file watcher logs
kubectl logs -n monitoring -l app.kubernetes.io/name=sealos-state-metrics | grep "reload"

# Trigger manual restart (if needed)
kubectl rollout restart deployment/sealos-state-metrics -n monitoring

Security Considerations

RBAC Permissions

The application requires the following Kubernetes permissions:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: sealos-state-metrics
rules:
  # Core resources
  - apiGroups: [""]
    resources: ["nodes", "pods", "services", "secrets", "namespaces"]
    verbs: ["get", "list", "watch"]

  # Events (for imagepull collector)
  - apiGroups: [""]
    resources: ["events"]
    verbs: ["get", "list", "watch"]

  # Leader election
  - apiGroups: ["coordination.k8s.io"]
    resources: ["leases"]
    verbs: ["get", "create", "update"]

Secret Access

Some collectors (database, cloudbalance) require access to Secrets:

Use namespace-based RBAC to limit secret access
Consider using external secret managers (Vault, External Secrets Operator)
Rotate credentials regularly
Monitor secret access via audit logs

Network Policies

If using network policies, allow:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: sealos-state-metrics
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: sealos-state-metrics
  policyTypes:
    - Ingress
    - Egress
  ingress:
    # Allow Prometheus to scrape metrics
    - from:
        - namespaceSelector:
            matchLabels:
              name: monitoring
      ports:
        - protocol: TCP
          port: 9090
  egress:
    # Allow Kubernetes API access
    - to:
        - namespaceSelector: {}
          podSelector:
            matchLabels:
              component: apiserver
      ports:
        - protocol: TCP
          port: 6443
    # Allow DNS
    - to:
        - namespaceSelector:
            matchLabels:
              name: kube-system
      ports:
        - protocol: UDP
          port: 53

Performance Tuning

Resource Recommendations

Small Clusters (< 50 nodes):

resources:
  requests:
    cpu: 50m
    memory: 64Mi
  limits:
    cpu: 500m
    memory: 256Mi

Medium Clusters (50-200 nodes):

resources:
  requests:
    cpu: 100m
    memory: 128Mi
  limits:
    cpu: 1000m
    memory: 512Mi

Large Clusters (> 200 nodes):

resources:
  requests:
    cpu: 200m
    memory: 256Mi
  limits:
    cpu: 2000m
    memory: 1Gi

Configuration Tuning

# Reduce Kubernetes API load
kubernetes:
  qps: 50       # Increase for large clusters
  burst: 100    # Burst capacity

# Adjust collector intervals
collectors:
  domain:
    checkInterval: "10m"  # Increase for less critical checks

  database:
    checkInterval: "5m"   # Balance between freshness and load
    checkTimeout: "10s"   # Adjust based on network latency

  zombie:
    checkInterval: "30s"  # Frequent for quick detection

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

How to Contribute

Report Bugs: Open an issue with reproduction steps
Suggest Features: Describe your use case and proposed solution
Submit PRs: Fork, create a feature branch, and submit a pull request
Add Collectors: Create new collectors for community benefit
Improve Docs: Fix typos, add examples, clarify explanations

Development Workflow

# Fork and clone
git clone https://github.com/yourusername/sealos-state-metrics.git
cd sealos-state-metrics

# Create feature branch
git checkout -b feature/my-collector

# Make changes and test
make test
make lint

# Commit and push
git commit -m "feat: add my-collector for monitoring X"
git push origin feature/my-collector

# Open pull request

License

Licensed under the Apache License, Version 2.0. See LICENSE for full text.

Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Community: Sealos Community
Documentation: Full Documentation

Related Projects

Sealos - Cloud operating system based on Kubernetes
kube-state-metrics - Original inspiration
Prometheus - Metrics collection and alerting
VictoriaMetrics - Time series database

Roadmap

v0.2.0

Cardinality control and metrics filtering
Plugin discovery and dynamic loading
Multi-cluster support

v0.3.0

OpenTelemetry support (OTLP exporter)
Built-in recording rules (VMRule integration)
Alerting integration (AlertManager direct support)

v1.0.0

Stability guarantees
Performance benchmarks
Production hardening

Made with ❤️ by the Sealos team

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
.github		.github
deploy		deploy
pkg		pkg
server		server
.gitignore		.gitignore
.golangci.yml		.golangci.yml
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
config.example.yaml		config.example.yaml
go.mod		go.mod
go.sum		go.sum
main.go		main.go

License

labring/sealos-state-metrics

Folders and files

Latest commit

History

Repository files navigation

Sealos State Metrics

Overview

Key Features

🔌 Extensible Collector System

🔥 Hot Configuration Reload

🎯 Dynamic CRD Monitoring

🎛️ Flexible Configuration

🏗️ Production-Ready Architecture

Available Collectors

Quick Start

Installation via Helm

Installation from Source

Docker Deployment

Configuration

Basic Configuration

Environment Variable Overrides

Configuration Hot Reload

Monitoring Integration

Prometheus Operator

VictoriaMetrics Operator

Manual Prometheus Configuration

Metrics Examples

Framework Metrics

Domain Collector Metrics

Database Collector Metrics

LVM Collector Metrics

Creating Custom Collectors

1. Create Directory Structure

2. Define Configuration

3. Implement Collector

4. Create Factory

5. Register in all.go

Development

Prerequisites

Building

Local Development

Testing in Kubernetes

Architecture

Component Overview

Configuration Flow

Hot Reload Flow

Comparison with kube-state-metrics

Troubleshooting

Check Pod Status

View Logs

Access Metrics Endpoint

Health Check

Leader Election Status

Common Issues

Security Considerations

RBAC Permissions

Secret Access

Network Policies

Performance Tuning

Resource Recommendations

Configuration Tuning

Contributing

How to Contribute

Development Workflow

License

Support

Related Projects

Roadmap

v0.2.0

v0.3.0

v1.0.0

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages