Skip to content

Latest commit

 

History

History
573 lines (422 loc) · 16.8 KB

File metadata and controls

573 lines (422 loc) · 16.8 KB

OpenTelemetry Observability

The kubernetes-mcp-server supports distributed tracing and metrics via OpenTelemetry (OTEL). Observability is optional and disabled by default.

What Gets Traced

The server automatically traces all operations through middleware without requiring any code changes to individual tools:

  1. MCP Tool Calls - Every tool invocation with details:

    • Tool name
    • Success/failure status
    • Duration
    • Error details (when applicable)
  2. HTTP Requests - All HTTP endpoints when running in HTTP mode:

    • Request method and path
    • Response status
    • Client information
    • Duration

Note: When running in STDIO mode only MCP tool calls are traced since there is no HTTP server.

Metrics

The server collects and exposes metrics through two mechanisms:

  1. Stats Endpoint (/stats) - JSON endpoint for real-time statistics:

    • Tool call counts by name
    • Tool call errors
    • HTTP request counts by method/path/status
    • Server uptime
  2. OTLP Export - When an endpoint is configured, metrics are also exported to your OTLP backend every 30 seconds.

Quick Start

1. Run an OTLP Backend Locally

Option A: Jaeger (traces only)

docker run -d --name jaeger \
  -e COLLECTOR_OTLP_ENABLED=true \
  -p 16686:16686 \
  -p 4317:4317 \
  -p 4318:4318 \
  docker.io/jaegertracing/all-in-one:latest

Access the Jaeger UI at http://localhost:16686

Note: Jaeger only supports traces, not metrics. To disable metrics export and avoid warnings about MetricsService being unimplemented, set OTEL_METRICS_EXPORTER=none.

Option B: Grafana LGTM Stack (traces + metrics + logs)

For full observability with metrics support:

docker run -d --name lgtm \
  -p 3000:3000 \
  -p 4317:4317 \
  -p 4318:4318 \
  docker.io/grafana/otel-lgtm:latest

Access Grafana at http://localhost:3000 (default credentials: admin/admin)

2. Enable Tracing

export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317

# Run the server
npx -y kubernetes-mcp-server@latest

3. View Traces

Make some tool calls through your MCP client, then view traces in the Jaeger UI.

Example Trace

When you call resources_get for a Pod, you'll see a trace like this in Jaeger:

Trace ID: abc123def456789
Duration: 145ms

└─ tools/call resources_get [145ms]
   ├─ mcp.method.name: tools/call
   ├─ gen_ai.tool.name: resources_get
   ├─ gen_ai.operation.name: execute_tool
   ├─ rpc.jsonrpc.version: 2.0
   ├─ network.transport: pipe
   └─ Status: OK

If the tool call triggers an HTTP request (in HTTP mode), you'll also see:

Trace ID: abc123def456789
Duration: 150ms

├─ POST /message [150ms]
│  ├─ http.request.method: POST
│  ├─ url.path: /message
│  ├─ http.response.status_code: 200
│  ├─ client.address: 192.168.1.100
│  │
│  └─ tools/call resources_get [145ms]
       ├─ mcp.method.name: tools/call
       ├─ gen_ai.tool.name: resources_get
       ├─ gen_ai.operation.name: execute_tool
       ├─ rpc.jsonrpc.version: 2.0
       ├─ network.transport: tcp
       └─ Status: OK

Configuration

OpenTelemetry can be configured via TOML config file or environment variables. Environment variables take precedence over TOML config values.

Note: Telemetry is automatically enabled when an endpoint is configured. Use enabled = false in TOML to explicitly disable it.

Configuration Reference

TOML Field Environment Variable Description
enabled - Explicit enable/disable (overrides all)
endpoint OTEL_EXPORTER_OTLP_ENDPOINT OTLP endpoint URL
protocol OTEL_EXPORTER_OTLP_PROTOCOL Protocol: grpc or http/protobuf
traces_sampler OTEL_TRACES_SAMPLER Sampling strategy
traces_sampler_arg OTEL_TRACES_SAMPLER_ARG Sampling ratio (0.0-1.0)

TOML Configuration

Add a [telemetry] section to your config file:

[telemetry]
# Optional: explicitly enable/disable (omit to auto-enable when endpoint is set)
enabled = true

endpoint = "http://localhost:4317"

# Protocol: "grpc" (default) or "http/protobuf"
protocol = "grpc"

# Trace sampling strategy
# Options: "always_on", "always_off", "traceidratio", "parentbased_always_on", "parentbased_always_off", "parentbased_traceidratio"
traces_sampler = "traceidratio"

# Sampling ratio for ratio-based samplers (0.0 to 1.0)
traces_sampler_arg = 0.1

TOML Examples

Enable with endpoint:

[telemetry]
endpoint = "http://localhost:4317"

Production with sampling:

[telemetry]
endpoint = "http://tempo-distributor:4317"
traces_sampler = "traceidratio"
traces_sampler_arg = 0.05  # 5% sampling

Explicitly disable:

[telemetry]
enabled = false

Environment Variables

Environment variables take precedence over TOML config. This allows you to override config file settings at runtime.

Endpoint

export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317

Note: The server gracefully handles failures. If the endpoint is unreachable, the server logs a warning and continues without tracing.

Optional Variables

# Service name (defaults to "kubernetes-mcp-server")
export OTEL_SERVICE_NAME=kubernetes-mcp-server

# Service version (auto-detected from binary, rarely needs manual override)
export OTEL_SERVICE_VERSION=1.0.0

# Additional resource attributes (useful for multi-environment deployments)
export OTEL_RESOURCE_ATTRIBUTES="deployment.environment=production,team=platform"

Endpoint Protocols

The server supports both gRPC and HTTP/protobuf protocols:

# gRPC (default, port 4317)
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317

# HTTP/protobuf (port 4318)
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf

# Secure endpoints (HTTPS/gRPC with TLS)
export OTEL_EXPORTER_OTLP_ENDPOINT=https://otlp-secure.example.com:4317

# Custom CA certificate (for self-signed certificates)
export OTEL_EXPORTER_OTLP_CERTIFICATE=/path/to/ca.crt

Sampling Configuration

By default, the server uses ParentBased(AlwaysSample) sampling:

  • Root spans (no parent): Always sampled (100%)
  • Child spans: Inherit parent's sampling decision

This is ideal for development but may generate high trace volumes in production.

Production Sampling

For production with high traffic, use ratio-based sampling:

# Sample 10% of traces
export OTEL_TRACES_SAMPLER=traceidratio
export OTEL_TRACES_SAMPLER_ARG=0.1

Available Samplers

  • always_on - Sample everything (default for root spans)
  • always_off - Disable tracing entirely
  • traceidratio - Sample a percentage (requires OTEL_TRACES_SAMPLER_ARG between 0.0 and 1.0)
  • parentbased_always_on - Respect parent span, default to always_on
  • parentbased_always_off - Respect parent span, default to always_off
  • parentbased_traceidratio - Respect parent span, default to ratio

Sampling Examples

# Development: Sample everything
export OTEL_TRACES_SAMPLER=always_on

# Production: 5% sampling (good for high-traffic services)
export OTEL_TRACES_SAMPLER=traceidratio
export OTEL_TRACES_SAMPLER_ARG=0.05

# Temporarily disable tracing
export OTEL_TRACES_SAMPLER=always_off

# Or just unset the endpoint
unset OTEL_EXPORTER_OTLP_ENDPOINT

Deployment Examples

Claude Code (STDIO Mode)

Add the MCP server to your project's .mcp.json or global ~/.claude/settings.json:

{
  "mcpServers": {
    "kubernetes": {
      "command": "npx",
      "args": ["-y", "kubernetes-mcp-server@latest"],
      "env": {
        "OTEL_EXPORTER_OTLP_ENDPOINT": "http://localhost:4317",
        "OTEL_TRACES_SAMPLER": "always_on"
      }
    }
  }
}

For Jaeger (traces only): Add "OTEL_METRICS_EXPORTER": "none" to disable metrics export.

Note: In STDIO mode, only MCP tool calls are traced (no HTTP request spans).

Kubernetes Deployment (HTTP Mode)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: kubernetes-mcp-server
spec:
  template:
    spec:
      containers:
      - name: kubernetes-mcp-server
        image: quay.io/containers/kubernetes_mcp_server:latest
        env:
        # OTLP endpoint (required to enable tracing)
        - name: OTEL_EXPORTER_OTLP_ENDPOINT
          value: "http://tempo-distributor.observability:4317"

        # Sampling (recommended for production)
        - name: OTEL_TRACES_SAMPLER
          value: "traceidratio"
        - name: OTEL_TRACES_SAMPLER_ARG
          value: "0.1"  # 10% sampling

        # Resource attributes (helps identify this deployment)
        - name: OTEL_RESOURCE_ATTRIBUTES
          value: "deployment.environment=production,k8s.cluster.name=prod-us-west-2"

        # Kubernetes metadata (optional, helps correlate traces with K8s resources)
        - name: KUBERNETES_POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: KUBERNETES_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: KUBERNETES_NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName

Note: The Kubernetes metadata environment variables are optional but recommended for production deployments. They help correlate traces with specific pods, namespaces, and nodes.

Docker

docker run \
  -e OTEL_EXPORTER_OTLP_ENDPOINT=http://host.docker.internal:4317 \
  -e OTEL_TRACES_SAMPLER=always_on \
  quay.io/containers/kubernetes_mcp_server:latest

Trace Attributes

MCP Tool Call Spans

Each tool call creates a span following MCP and OpenTelemetry semantic conventions:

Span Name Format: {mcp.method.name} {target} (e.g., "tools/call resources_get")

Attributes:

  • mcp.method.name - MCP protocol method (e.g., "tools/call") [Required]
  • gen_ai.tool.name - Name of the tool being called (e.g., "resources_get", "helm_install") [Required for tool calls]
  • gen_ai.operation.name - Set to "execute_tool" for tool calls [Recommended]
  • rpc.jsonrpc.version - JSON-RPC version (typically "2.0") [Recommended]
  • network.transport - Transport protocol: "pipe" for STDIO, "tcp" for HTTP [Recommended]
  • error.type - Error classification: "tool_error" for tool failures, "_OTHER" for other errors [Conditional]

HTTP Request Spans

HTTP requests create spans following OpenTelemetry HTTP semantic conventions:

Span Name Format: {METHOD} {path} (e.g., "POST /message")

Attributes:

  • http.request.method - Request method (GET, POST, etc.) [Required]
  • url.path - URL path [Required]
  • url.scheme - URL scheme (http or https) [Required]
  • server.address - Server host [Recommended]
  • network.protocol.name - Protocol name (http) [Recommended]
  • network.protocol.version - Protocol version (HTTP/1.1, HTTP/2) [Recommended]
  • client.address - Client IP address [Recommended]
  • http.route - Normalized route pattern (when different from path) [Conditional]
  • user_agent.original - User agent string (when present) [Conditional]
  • http.request.body.size - Request body size (when present) [Conditional]
  • http.response.status_code - Response status code [Required]
  • error.type - HTTP status code for 4xx/5xx responses [Conditional]

Note: HTTP spans only appear when running in HTTP mode. STDIO mode (Claude Code) only creates MCP tool call spans. The /healthz endpoint is not traced to reduce noise.

Stats Endpoint

When running in HTTP mode, the server exposes a /stats endpoint that returns real-time statistics as JSON:

curl http://localhost:8080/stats

Example response:

{
  "total_tool_calls": 42,
  "tool_call_errors": 2,
  "tool_calls_by_name": {
    "resources_list": 15,
    "pods_get": 12,
    "helm_list": 10,
    "resources_get": 5
  },
  "total_http_requests": 100,
  "http_requests_by_path": {
    "/mcp": 50,
    "/sse": 30,
    "/message": 20
  },
  "uptime_seconds": 3600.5
}

The stats endpoint is useful for:

  • Health monitoring and alerting
  • Quick debugging without a full observability stack
  • Integration with simple monitoring systems

Note: The /stats endpoint is only available in HTTP mode. In STDIO mode, use OTLP export for metrics.

Metrics Endpoint

When running in HTTP mode, the server exposes a /metrics endpoint for Prometheus scraping:

curl http://localhost:8080/metrics

This endpoint returns metrics in OpenMetrics/Prometheus text format, suitable for scraping by Prometheus or compatible systems.

Available Metrics

Metric Type Description
k8s_mcp_tool_calls_total Counter Total MCP tool calls (labeled by tool_name)
k8s_mcp_tool_errors_total Counter Total MCP tool errors (labeled by tool_name)
k8s_mcp_tool_duration_seconds Histogram Tool call duration in seconds
k8s_mcp_http_requests_total Counter HTTP requests (labeled by http_request_method, url_path, http_response_status_class)
k8s_mcp_server_info Gauge Server info (labeled by version, go_version)

Prometheus Scrape Configuration

scrape_configs:
  - job_name: 'kubernetes-mcp-server'
    static_configs:
      - targets: ['localhost:8080']
    metrics_path: /metrics

Kubernetes ServiceMonitor

When deployed in Kubernetes with the Helm chart, enable the ServiceMonitor:

metrics:
  serviceMonitor:
    enabled: true
    interval: 30s

Note: The /metrics endpoint is only available in HTTP mode.

Troubleshooting

Tracing not working?

  1. Check endpoint is set:

    echo $OTEL_EXPORTER_OTLP_ENDPOINT
  2. Check server logs (increase verbosity):

    # Look for "OpenTelemetry tracing initialized successfully"
    kubernetes-mcp-server -v 2

    If tracing fails to initialize, you'll see:

    Failed to create OTLP exporter, tracing disabled: <error details>
    
  3. Verify OTLP collector is reachable:

    # For gRPC endpoint (port 4317)
    telnet localhost 4317
    
    # For HTTP endpoint (port 4318)
    curl http://localhost:4318/v1/traces

No traces appearing in backend?

  1. Check sampling - you might be sampling at 0% or using always_off:

    echo $OTEL_TRACES_SAMPLER
    echo $OTEL_TRACES_SAMPLER_ARG
  2. Verify service name:

    echo $OTEL_SERVICE_NAME

    Search for this service name in your tracing UI (defaults to "kubernetes-mcp-server").

  3. Check backend configuration - ensure your OTLP collector is forwarding to the right backend.

  4. Verify protocol compatibility:

    • If using HTTP-based backends, ensure you set OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
    • Check if you need port 4317 (gRPC) or 4318 (HTTP)

TLS/Certificate Issues

If using HTTPS/secure endpoints:

  1. Certificate errors:

    # Provide custom CA certificate
    export OTEL_EXPORTER_OTLP_CERTIFICATE=/path/to/ca.crt
  2. Self-signed certificates:

    # For testing only - not recommended for production
    export OTEL_EXPORTER_OTLP_INSECURE=true

Performance Impact

Tracing has minimal performance overhead:

  • Middleware tracing: Typically 1-2ms per tool call
  • Network overhead: Spans are batched and exported every 5 seconds
  • Memory: Approximately 1-5MB for span buffers
  • CPU: Negligible (<1% for most workloads)

For production deployments with high traffic, use ratio-based sampling to reduce costs while maintaining observability.

Advanced Topics

Resource Detection

The OpenTelemetry SDK automatically detects and adds resource attributes from the environment:

  • Host information: hostname, OS, architecture
  • Process information: PID, executable name
  • Container information: container ID (when running in containers)
  • Kubernetes information: pod name, namespace (when K8s env vars are present)

These are merged with any attributes you set via OTEL_RESOURCE_ATTRIBUTES.

Distributed Tracing

When the kubernetes-mcp-server is part of a distributed system:

  1. Parent spans are automatically detected and respected
  2. Trace context is propagated via standard W3C Trace Context headers
  3. Sampling decisions from parent spans are inherited (via ParentBased sampler)

This means traces can span multiple services seamlessly.

Custom Resource Attributes

Add custom attributes to help identify and filter traces:

export OTEL_RESOURCE_ATTRIBUTES="deployment.environment=staging,team=platform,region=us-west-2,version=v1.2.3"

These attributes appear on all spans from this service instance and are useful for:

  • Filtering traces by environment (prod vs staging)
  • Analyzing performance by region or deployment
  • Tracking issues to specific versions or teams