OpenTelemetry Observability

The kubernetes-mcp-server supports distributed tracing and metrics via OpenTelemetry (OTEL). Observability is optional and disabled by default.

What Gets Traced

The server automatically traces all operations through middleware without requiring any code changes to individual tools:

MCP Tool Calls - Every tool invocation with details:
- Tool name
- Success/failure status
- Duration
- Error details (when applicable)
HTTP Requests - All HTTP endpoints when running in HTTP mode:
- Request method and path
- Response status
- Client information
- Duration

Note: When running in STDIO mode only MCP tool calls are traced since there is no HTTP server.

Metrics

The server collects and exposes metrics through two mechanisms:

Stats Endpoint (/stats) - JSON endpoint for real-time statistics:
- Tool call counts by name
- Tool call errors
- HTTP request counts by method/path/status
- Server uptime
OTLP Export - When an endpoint is configured, metrics are also exported to your OTLP backend every 30 seconds.

Quick Start

1. Run an OTLP Backend Locally

Option A: Jaeger (traces only)

docker run -d --name jaeger \
  -e COLLECTOR_OTLP_ENABLED=true \
  -p 16686:16686 \
  -p 4317:4317 \
  -p 4318:4318 \
  docker.io/jaegertracing/all-in-one:latest

Access the Jaeger UI at http://localhost:16686

Note: Jaeger only supports traces, not metrics. To disable metrics export and avoid warnings about MetricsService being unimplemented, set OTEL_METRICS_EXPORTER=none.

Option B: Grafana LGTM Stack (traces + metrics + logs)

For full observability with metrics support:

docker run -d --name lgtm \
  -p 3000:3000 \
  -p 4317:4317 \
  -p 4318:4318 \
  docker.io/grafana/otel-lgtm:latest

Access Grafana at http://localhost:3000 (default credentials: admin/admin)

2. Enable Tracing

export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317

# Run the server
npx -y kubernetes-mcp-server@latest

3. View Traces

Make some tool calls through your MCP client, then view traces in the Jaeger UI.

Example Trace

When you call resources_get for a Pod, you'll see a trace like this in Jaeger:

Trace ID: abc123def456789
Duration: 145ms

└─ tools/call resources_get [145ms]
   ├─ mcp.method.name: tools/call
   ├─ gen_ai.tool.name: resources_get
   ├─ gen_ai.operation.name: execute_tool
   ├─ rpc.jsonrpc.version: 2.0
   ├─ network.transport: pipe
   └─ Status: OK

If the tool call triggers an HTTP request (in HTTP mode), you'll also see:

Trace ID: abc123def456789
Duration: 150ms

├─ POST /message [150ms]
│  ├─ http.request.method: POST
│  ├─ url.path: /message
│  ├─ http.response.status_code: 200
│  ├─ client.address: 192.168.1.100
│  │
│  └─ tools/call resources_get [145ms]
       ├─ mcp.method.name: tools/call
       ├─ gen_ai.tool.name: resources_get
       ├─ gen_ai.operation.name: execute_tool
       ├─ rpc.jsonrpc.version: 2.0
       ├─ network.transport: tcp
       └─ Status: OK

Configuration

OpenTelemetry can be configured via TOML config file or environment variables. Environment variables take precedence over TOML config values.

Note: Telemetry is automatically enabled when an endpoint is configured. Use enabled = false in TOML to explicitly disable it.

Configuration Reference

TOML Field	Environment Variable	Description
`enabled`	-	Explicit enable/disable (overrides all)
`endpoint`	`OTEL_EXPORTER_OTLP_ENDPOINT`	OTLP endpoint URL
`protocol`	`OTEL_EXPORTER_OTLP_PROTOCOL`	Protocol: `grpc` or `http/protobuf`
`traces_sampler`	`OTEL_TRACES_SAMPLER`	Sampling strategy
`traces_sampler_arg`	`OTEL_TRACES_SAMPLER_ARG`	Sampling ratio (0.0-1.0)

TOML Configuration

Add a [telemetry] section to your config file:

[telemetry]
# Optional: explicitly enable/disable (omit to auto-enable when endpoint is set)
enabled = true

endpoint = "http://localhost:4317"

# Protocol: "grpc" (default) or "http/protobuf"
protocol = "grpc"

# Trace sampling strategy
# Options: "always_on", "always_off", "traceidratio", "parentbased_always_on", "parentbased_always_off", "parentbased_traceidratio"
traces_sampler = "traceidratio"

# Sampling ratio for ratio-based samplers (0.0 to 1.0)
traces_sampler_arg = 0.1

TOML Examples

Enable with endpoint:

[telemetry]
endpoint = "http://localhost:4317"

Production with sampling:

[telemetry]
endpoint = "http://tempo-distributor:4317"
traces_sampler = "traceidratio"
traces_sampler_arg = 0.05  # 5% sampling

Explicitly disable:

[telemetry]
enabled = false

Environment Variables

Environment variables take precedence over TOML config. This allows you to override config file settings at runtime.

Endpoint

export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317

Note: The server gracefully handles failures. If the endpoint is unreachable, the server logs a warning and continues without tracing.

Optional Variables

# Service name (defaults to "kubernetes-mcp-server")
export OTEL_SERVICE_NAME=kubernetes-mcp-server

# Service version (auto-detected from binary, rarely needs manual override)
export OTEL_SERVICE_VERSION=1.0.0

# Additional resource attributes (useful for multi-environment deployments)
export OTEL_RESOURCE_ATTRIBUTES="deployment.environment=production,team=platform"

Endpoint Protocols

The server supports both gRPC and HTTP/protobuf protocols:

# gRPC (default, port 4317)
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317

# HTTP/protobuf (port 4318)
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf

# Secure endpoints (HTTPS/gRPC with TLS)
export OTEL_EXPORTER_OTLP_ENDPOINT=https://otlp-secure.example.com:4317

# Custom CA certificate (for self-signed certificates)
export OTEL_EXPORTER_OTLP_CERTIFICATE=/path/to/ca.crt

Sampling Configuration

By default, the server uses ParentBased(AlwaysSample) sampling:

Root spans (no parent): Always sampled (100%)
Child spans: Inherit parent's sampling decision

This is ideal for development but may generate high trace volumes in production.

Production Sampling

For production with high traffic, use ratio-based sampling:

# Sample 10% of traces
export OTEL_TRACES_SAMPLER=traceidratio
export OTEL_TRACES_SAMPLER_ARG=0.1

Available Samplers

always_on - Sample everything (default for root spans)
always_off - Disable tracing entirely
traceidratio - Sample a percentage (requires OTEL_TRACES_SAMPLER_ARG between 0.0 and 1.0)
parentbased_always_on - Respect parent span, default to always_on
parentbased_always_off - Respect parent span, default to always_off
parentbased_traceidratio - Respect parent span, default to ratio

Sampling Examples

# Development: Sample everything
export OTEL_TRACES_SAMPLER=always_on

# Production: 5% sampling (good for high-traffic services)
export OTEL_TRACES_SAMPLER=traceidratio
export OTEL_TRACES_SAMPLER_ARG=0.05

# Temporarily disable tracing
export OTEL_TRACES_SAMPLER=always_off

# Or just unset the endpoint
unset OTEL_EXPORTER_OTLP_ENDPOINT

Deployment Examples

Claude Code (STDIO Mode)

Add the MCP server to your project's .mcp.json or global ~/.claude/settings.json:

{
  "mcpServers": {
    "kubernetes": {
      "command": "npx",
      "args": ["-y", "kubernetes-mcp-server@latest"],
      "env": {
        "OTEL_EXPORTER_OTLP_ENDPOINT": "http://localhost:4317",
        "OTEL_TRACES_SAMPLER": "always_on"
      }
    }
  }
}

For Jaeger (traces only): Add "OTEL_METRICS_EXPORTER": "none" to disable metrics export.

Note: In STDIO mode, only MCP tool calls are traced (no HTTP request spans).

Kubernetes Deployment (HTTP Mode)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: kubernetes-mcp-server
spec:
  template:
    spec:
      containers:
      - name: kubernetes-mcp-server
        image: quay.io/containers/kubernetes_mcp_server:latest
        env:
        # OTLP endpoint (required to enable tracing)
        - name: OTEL_EXPORTER_OTLP_ENDPOINT
          value: "http://tempo-distributor.observability:4317"

        # Sampling (recommended for production)
        - name: OTEL_TRACES_SAMPLER
          value: "traceidratio"
        - name: OTEL_TRACES_SAMPLER_ARG
          value: "0.1"  # 10% sampling

        # Resource attributes (helps identify this deployment)
        - name: OTEL_RESOURCE_ATTRIBUTES
          value: "deployment.environment=production,k8s.cluster.name=prod-us-west-2"

        # Kubernetes metadata (optional, helps correlate traces with K8s resources)
        - name: KUBERNETES_POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: KUBERNETES_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: KUBERNETES_NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName

Note: The Kubernetes metadata environment variables are optional but recommended for production deployments. They help correlate traces with specific pods, namespaces, and nodes.

Docker

docker run \
  -e OTEL_EXPORTER_OTLP_ENDPOINT=http://host.docker.internal:4317 \
  -e OTEL_TRACES_SAMPLER=always_on \
  quay.io/containers/kubernetes_mcp_server:latest

Trace Attributes

MCP Tool Call Spans

Each tool call creates a span following MCP and OpenTelemetry semantic conventions:

Span Name Format: {mcp.method.name} {target} (e.g., "tools/call resources_get")

Attributes:

mcp.method.name - MCP protocol method (e.g., "tools/call") [Required]
gen_ai.tool.name - Name of the tool being called (e.g., "resources_get", "helm_install") [Required for tool calls]
gen_ai.operation.name - Set to "execute_tool" for tool calls [Recommended]
rpc.jsonrpc.version - JSON-RPC version (typically "2.0") [Recommended]
network.transport - Transport protocol: "pipe" for STDIO, "tcp" for HTTP [Recommended]
error.type - Error classification: "tool_error" for tool failures, "_OTHER" for other errors [Conditional]

HTTP Request Spans

HTTP requests create spans following OpenTelemetry HTTP semantic conventions:

Span Name Format: {METHOD} {path} (e.g., "POST /message")

Attributes:

http.request.method - Request method (GET, POST, etc.) [Required]
url.path - URL path [Required]
url.scheme - URL scheme (http or https) [Required]
server.address - Server host [Recommended]
network.protocol.name - Protocol name (http) [Recommended]
network.protocol.version - Protocol version (HTTP/1.1, HTTP/2) [Recommended]
client.address - Client IP address [Recommended]
http.route - Normalized route pattern (when different from path) [Conditional]
user_agent.original - User agent string (when present) [Conditional]
http.request.body.size - Request body size (when present) [Conditional]
http.response.status_code - Response status code [Required]
error.type - HTTP status code for 4xx/5xx responses [Conditional]

Note: HTTP spans only appear when running in HTTP mode. STDIO mode (Claude Code) only creates MCP tool call spans. The /healthz endpoint is not traced to reduce noise.

Stats Endpoint

When running in HTTP mode, the server exposes a /stats endpoint that returns real-time statistics as JSON:

curl http://localhost:8080/stats

Example response:

{
  "total_tool_calls": 42,
  "tool_call_errors": 2,
  "tool_calls_by_name": {
    "resources_list": 15,
    "pods_get": 12,
    "helm_list": 10,
    "resources_get": 5
  },
  "total_http_requests": 100,
  "http_requests_by_path": {
    "/mcp": 50,
    "/sse": 30,
    "/message": 20
  },
  "uptime_seconds": 3600.5
}

The stats endpoint is useful for:

Health monitoring and alerting
Quick debugging without a full observability stack
Integration with simple monitoring systems

Note: The /stats endpoint is only available in HTTP mode. In STDIO mode, use OTLP export for metrics.

Metrics Endpoint

When running in HTTP mode, the server exposes a /metrics endpoint for Prometheus scraping:

curl http://localhost:8080/metrics

This endpoint returns metrics in OpenMetrics/Prometheus text format, suitable for scraping by Prometheus or compatible systems.

Available Metrics

Metric	Type	Description
`k8s_mcp_tool_calls_total`	Counter	Total MCP tool calls (labeled by `tool_name`)
`k8s_mcp_tool_errors_total`	Counter	Total MCP tool errors (labeled by `tool_name`)
`k8s_mcp_tool_duration_seconds`	Histogram	Tool call duration in seconds
`k8s_mcp_http_requests_total`	Counter	HTTP requests (labeled by `http_request_method`, `url_path`, `http_response_status_class`)
`k8s_mcp_server_info`	Gauge	Server info (labeled by `version`, `go_version`)

Prometheus Scrape Configuration

scrape_configs:
  - job_name: 'kubernetes-mcp-server'
    static_configs:
      - targets: ['localhost:8080']
    metrics_path: /metrics

Kubernetes ServiceMonitor

When deployed in Kubernetes with the Helm chart, enable the ServiceMonitor:

metrics:
  serviceMonitor:
    enabled: true
    interval: 30s

Note: The /metrics endpoint is only available in HTTP mode.

Troubleshooting

Tracing not working?

Check endpoint is set:
```
echo $OTEL_EXPORTER_OTLP_ENDPOINT
```

Check server logs (increase verbosity):

# Look for "OpenTelemetry tracing initialized successfully"
kubernetes-mcp-server -v 2

If tracing fails to initialize, you'll see:

Failed to create OTLP exporter, tracing disabled: <error details>

Verify OTLP collector is reachable:

# For gRPC endpoint (port 4317)
telnet localhost 4317

# For HTTP endpoint (port 4318)
curl http://localhost:4318/v1/traces

No traces appearing in backend?

Check sampling - you might be sampling at 0% or using always_off:
```
echo $OTEL_TRACES_SAMPLER
echo $OTEL_TRACES_SAMPLER_ARG
```
Verify service name:
```
echo $OTEL_SERVICE_NAME
```
Search for this service name in your tracing UI (defaults to "kubernetes-mcp-server").
Check backend configuration - ensure your OTLP collector is forwarding to the right backend.
Verify protocol compatibility:
- If using HTTP-based backends, ensure you set OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
- Check if you need port 4317 (gRPC) or 4318 (HTTP)

TLS/Certificate Issues

If using HTTPS/secure endpoints:

Certificate errors:

# Provide custom CA certificate
export OTEL_EXPORTER_OTLP_CERTIFICATE=/path/to/ca.crt

Self-signed certificates:

# For testing only - not recommended for production
export OTEL_EXPORTER_OTLP_INSECURE=true

Performance Impact

Tracing has minimal performance overhead:

Middleware tracing: Typically 1-2ms per tool call
Network overhead: Spans are batched and exported every 5 seconds
Memory: Approximately 1-5MB for span buffers
CPU: Negligible (<1% for most workloads)

For production deployments with high traffic, use ratio-based sampling to reduce costs while maintaining observability.

Advanced Topics

Resource Detection

The OpenTelemetry SDK automatically detects and adds resource attributes from the environment:

Host information: hostname, OS, architecture
Process information: PID, executable name
Container information: container ID (when running in containers)
Kubernetes information: pod name, namespace (when K8s env vars are present)

These are merged with any attributes you set via OTEL_RESOURCE_ATTRIBUTES.

Distributed Tracing

When the kubernetes-mcp-server is part of a distributed system:

Parent spans are automatically detected and respected
Trace context is propagated via standard W3C Trace Context headers
Sampling decisions from parent spans are inherited (via ParentBased sampler)

This means traces can span multiple services seamlessly.

Custom Resource Attributes

Add custom attributes to help identify and filter traces:

export OTEL_RESOURCE_ATTRIBUTES="deployment.environment=staging,team=platform,region=us-west-2,version=v1.2.3"

These attributes appear on all spans from this service instance and are useful for:

Filtering traces by environment (prod vs staging)
Analyzing performance by region or deployment
Tracking issues to specific versions or teams

FilesExpand file tree

OTEL.md

Latest commit

History