Skip to content

Telemetry

Lisa edited this page Dec 23, 2025 · 8 revisions

Runtime Telemetry (v6.4)

CKB integrates with OpenTelemetry to answer the question static analysis can't: "Is this code actually used in production?"

By ingesting runtime metrics, CKB can:

  • Detect dead code with high confidence
  • Show actual call counts for any symbol
  • Enrich impact analysis with observed callers
  • Distinguish between "no static references" and "truly unused"

Quick Start

1. Enable Telemetry

Add to .ckb/config.json:

{
  "telemetry": {
    "enabled": true,
    "serviceMap": {
      "my-api-service": "my-repo"
    }
  }
}

2. Configure Your OpenTelemetry Collector

Point your collector's exporter at CKB:

# otel-collector-config.yaml
exporters:
  otlphttp:
    endpoint: "http://localhost:9120/v1/metrics"

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlphttp]

3. Verify It's Working

ckb telemetry status

You should see:

Telemetry Status
  Enabled: true
  Last Sync: 2 minutes ago

Coverage:
  Symbol: 78% (23,456 of 30,123 symbols have telemetry)
  Service: 100% (3 of 3 repos mapped)

Coverage Level: HIGH

Coverage Levels

CKB requires sufficient telemetry coverage to enable certain features:

Coverage Symbol % What's Available
High ≥ 70% Full dead code detection, high-confidence verdicts
Medium 40-69% Dead code detection with caveats, observed usage
Low 10-39% Basic observed usage only
Insufficient < 10% Telemetry features disabled

Check your coverage:

ckb telemetry status

Finding Dead Code

With telemetry enabled, you can find code that's never called in production:

# Find dead code candidates (requires medium+ coverage)
ckb dead-code --min-confidence 0.7

# Scope to a specific module
ckb dead-code --scope internal/legacy

# Include low-confidence results
ckb dead-code --min-confidence 0.5

Understanding Confidence

Dead code confidence combines:

  • Static analysis — No references found in code
  • Telemetry — Zero calls observed over the period
  • Match quality — How well telemetry maps to symbols
Confidence Meaning
0.9+ High confidence dead code — safe to remove
0.7-0.9 Likely dead — verify before removing
0.5-0.7 Possibly dead — investigate further
< 0.5 Uncertain — may have dynamic callers

Checking Symbol Usage

Get observed usage for any symbol:

ckb telemetry usage --symbol "internal/api/handler.go:HandleRequest"

Output:

Symbol: HandleRequest
Period: Last 90 days

Observed Usage:
  Total Calls: 1,247,832
  Daily Average: 13,864
  Trend: stable

Match Quality: exact
Last Seen: 2 hours ago

Match Quality Levels

Quality Meaning
exact Symbol name matches telemetry span exactly
strong High-confidence fuzzy match
weak Low-confidence match — verify manually

Service Mapping

CKB needs to know which telemetry service corresponds to which repository.

Explicit Mapping (Recommended)

{
  "telemetry": {
    "serviceMap": {
      "api-gateway": "api-repo",
      "user-service": "users-repo",
      "payment-worker": "payments-repo"
    }
  }
}

Resolution Order

  1. Explicit serviceMap entry
  2. ckb_repo_id attribute in telemetry payload
  3. Service name matches repo name exactly

Debugging Unmapped Services

# See which services aren't mapped
ckb telemetry unmapped

# Test if a service name would map
ckb telemetry test-map "my-service-name"

Telemetry-Enhanced Impact Analysis

When telemetry is enabled, analyzeImpact includes observed callers:

# Via CLI
ckb impact <symbol-id> --include-telemetry

# Via MCP
analyzeImpact({ symbolId: "...", includeTelemetry: true })

The response includes:

  • observedCallers — Services that call this symbol at runtime
  • blendedConfidence — Combines static + observed confidence
  • observedOnly — Callers found only via telemetry (not in code)

This catches cases where static analysis misses dynamic dispatch or reflection.


Configuration Reference

Full configuration options:

{
  "telemetry": {
    "enabled": true,

    "serviceMap": {
      "service-name": "repo-id"
    },

    "storage": {
      "retention_days": 365,
      "aggregation_interval": "1h",
      "max_symbols_per_service": 100000
    },

    "privacy": {
      "hash_service_names": false,
      "redact_symbol_names": false
    }
  }
}
Setting Default Description
enabled false Enable telemetry features
serviceMap {} Maps service names to repo IDs
storage.retention_days 365 Days to retain telemetry data
storage.aggregation_interval "1h" How often to aggregate metrics
privacy.hash_service_names false Hash service names for privacy

CLI Commands

# Check status and coverage
ckb telemetry status

# Get usage for a symbol
ckb telemetry usage --symbol "pkg/handler.go:HandleRequest"

# List unmapped services
ckb telemetry unmapped

# Test service name mapping
ckb telemetry test-map "my-service"

# Find dead code
ckb dead-code [--min-confidence 0.7] [--scope module]

MCP Tools

Tool Purpose
getTelemetryStatus Coverage metrics and sync status
getObservedUsage Runtime usage for a symbol
findDeadCodeCandidates Symbols with zero runtime calls

Enhanced tools:

  • analyzeImpact — Add includeTelemetry: true for observed callers
  • getHotspots — Includes observedUsage when telemetry enabled

HTTP API

# Get status
curl http://localhost:8080/telemetry/status

# Get symbol usage
curl "http://localhost:8080/telemetry/usage/SYMBOL_ID?period=30d"

# Find dead code
curl "http://localhost:8080/telemetry/dead-code?minConfidence=0.7"

# List unmapped services
curl http://localhost:8080/telemetry/unmapped

# OTLP ingest endpoint (for collectors)
POST http://localhost:9120/v1/metrics

Troubleshooting

"Telemetry not enabled"

Add to .ckb/config.json:

{ "telemetry": { "enabled": true } }

"Coverage insufficient"

Your instrumentation may not cover enough symbols. Check:

  • Are all services sending telemetry?
  • Is serviceMap configured correctly?
  • Run ckb telemetry unmapped to find gaps

"No data for symbol"

Possible causes:

  • Symbol isn't called at runtime (it may actually be dead)
  • Service mapping is wrong
  • Telemetry span names don't match symbol names

Debug with:

ckb telemetry test-map "your-service-name"

High latency on telemetry queries

Reduce retention or aggregation interval:

{
  "telemetry": {
    "storage": {
      "retention_days": 90,
      "aggregation_interval": "1d"
    }
  }
}

Best Practices

  1. Start with explicit serviceMap — Don't rely on auto-detection
  2. Check coverage before trusting dead-code — Medium+ coverage required
  3. Use 90-day periods — Catches infrequent code paths (monthly jobs, etc.)
  4. Verify before deleting — Even high-confidence dead code should be reviewed
  5. Monitor unmapped services — New services need to be added to serviceMap

Wide-Result Metrics (v7.4)

In addition to runtime telemetry, CKB tracks internal metrics for MCP wide-result tools. This helps identify which tools experience heavy truncation and may benefit from Frontier mode.

What's Tracked

For each wide-result tool invocation:

  • Tool name — findReferences, searchSymbols, analyzeImpact, getCallGraph, getHotspots, summarizePr
  • Total results — How many results were found
  • Returned results — How many were returned after truncation
  • Truncation count — How many were dropped
  • Response bytes — Actual JSON response size in bytes
  • Estimated tokens — Approximate token cost (~4 bytes per token)
  • Execution time — Latency in milliseconds

Metrics are stored in SQLite (.ckb/ckb.db) and persist across MCP sessions.

CLI: ckb metrics

# Last 7 days (default)
ckb metrics

# Last 30 days
ckb metrics --days=30

# Filter to specific tool
ckb metrics --tool=findReferences

# Human-readable format
ckb metrics --format=human

Example output:

{
  "period": "last 7 days",
  "since": "2025-12-16",
  "totalRecords": 847,
  "tools": [
    {
      "name": "searchSymbols",
      "queryCount": 312,
      "totalResults": 15234,
      "totalReturned": 8456,
      "totalTruncated": 6778,
      "truncationRate": 0.445,
      "totalBytes": 4780000,
      "avgBytes": 15321,
      "avgTokens": 3830,
      "avgLatencyMs": 125,
      "needsFrontier": true
    },
    {
      "name": "getCallGraph",
      "queryCount": 189,
      "totalResults": 2341,
      "totalReturned": 2341,
      "totalTruncated": 0,
      "truncationRate": 0,
      "totalBytes": 890000,
      "avgBytes": 4708,
      "avgTokens": 1177,
      "avgLatencyMs": 32,
      "needsFrontier": false
    }
  ]
}

The needsFrontier flag is true when truncation rate exceeds 30%.

Byte Tracking

Response bytes are measured by JSON-marshaling the response data before sending. This captures the actual payload size consumed by the LLM context window.

Typical response sizes observed:

Tool Avg Response Avg Tokens
searchSymbols 15-43 KB 4,000-11,000
getHotspots 20-40 KB 5,000-10,000
findReferences 8-15 KB 2,000-4,000
getCallGraph 5-10 KB 1,250-2,500
analyzeImpact 2-5 KB 500-1,250

This data helps measure the actual impact of Frontier mode by comparing bytes before/after pagination.

MCP Tool

getWideResultMetrics

Returns the same aggregated metrics via MCP. Useful for AI-driven analysis of tool performance.

Interpreting Results

Truncation Rate Recommendation
< 10% Tool is performing well, no action needed
10-30% Monitor usage patterns
> 30% Consider Frontier mode for this tool
> 50% Frontier mode strongly recommended

Data Retention

Metrics are retained for 90 days by default. Old records are cleaned up automatically.

Architecture

RecordWideResult() → In-memory aggregator + SQLite persistence
                                    ↓
          ckb metrics CLI ← GetWideResultAggregates()
                                    ↓
       getWideResultMetrics MCP ← Same data via MCP

Persistence is non-blocking (async writes) to avoid impacting tool latency.


Related Pages

Clone this wiki locally