Skip to content

Comments

[exporter/loadbalancing] Add routing key support for logs#46241

Open
szibis wants to merge 12 commits intoopen-telemetry:mainfrom
szibis:feature/logs-hashing-loadbalancer
Open

[exporter/loadbalancing] Add routing key support for logs#46241
szibis wants to merge 12 commits intoopen-telemetry:mainfrom
szibis:feature/logs-hashing-loadbalancer

Conversation

@szibis
Copy link

@szibis szibis commented Feb 20, 2026

Description

Implements proper routing_key support for logs in the loadbalancing exporter. Fully backward compatible — existing configurations continue to work without changes.

The problem: The previous log routing implementation used traceID-based hashing regardless of the configured routing_key. Since most logs don't carry trace IDs, this meant routing_key: "service" had no effect — logs were essentially randomly distributed across backends, making stateful downstream processing (throttling, rate-limiting, deduplication) per-service impossible.

The fix: Rewrite the log exporter to support configurable routing modes, mirroring the existing metrics exporter architecture:

routing_key Routing behavior
service (default) Hash on service.name resource attribute
traceID Hash on first log record's traceID (backward compatible)
resource Hash on full resource identity (all resource attributes)
attributes Hash on configurable attributes from resource, scope, or log record levels

Backward compatibility: The traceID routing key is preserved for logs, so existing configurations with routing_key: "traceID" continue to work identically. The default routing key changes from implicit traceID to explicit service — this is the intended fix, as the old behavior was effectively random for most logs (since they lack traceIDs). Users who explicitly want the old behavior can set routing_key: "traceID".

Use case — Per-service log processing with stateful backends:

A common production pattern is collecting logs via filelog across nodes, then routing through a loadbalancing exporter to a fleet of stateful backend collectors (e.g. OTel Collector aggregators in a StatefulSet) for per-service processing like rate limiting, log reduction, or tail-based sampling. Without proper routing, the same service's logs scatter across all backend pods, making per-service processing ineffective (you get per-pod limits instead of per-service limits).

flowchart LR
    subgraph K8s Nodes
        FL1[filelog receiver<br/>Node 1] 
        FL2[filelog receiver<br/>Node 2]
        FL3[filelog receiver<br/>Node 3]
    end

    subgraph Gateway Collectors
        LB1[loadbalancing exporter<br/>routing_key: service]
        LB2[loadbalancing exporter<br/>routing_key: service]
        LB3[loadbalancing exporter<br/>routing_key: service]
    end

    subgraph Aggregator Collectors - StatefulSet
        A0[collector-0<br/>rate-limit: svc-a, svc-b]
        A1[collector-1<br/>rate-limit: svc-c, svc-d]
        A2[collector-2<br/>rate-limit: svc-e, svc-f]
    end

    FL1 --> LB1
    FL2 --> LB2
    FL3 --> LB3

    LB1 -->|svc-a logs| A0
    LB1 -->|svc-c logs| A1
    LB2 -->|svc-a logs| A0
    LB2 -->|svc-d logs| A1
    LB3 -->|svc-b logs| A0
    LB3 -->|svc-f logs| A2

    style A0 fill:#2d6,stroke:#333,color:#fff
    style A1 fill:#26d,stroke:#333,color:#fff
    style A2 fill:#d62,stroke:#333,color:#fff
Loading

Key property: All gateway collectors consistently route svc-a logs to the same aggregator pod (via consistent hashing), regardless of which node collected them. This makes per-service rate limiting effective because a single aggregator sees 100% of a given service's logs.

Before this PR: routing_key: "service" was silently ignored for logs — traffic was routed by traceID (effectively random for most logs), causing scattered delivery as reported in the linked issue.

After this PR: Logs are consistently routed by service.name (or resource identity, or custom attributes), enabling reliable per-service stateful processing.


Benchmark Results

All benchmarks run on Apple M3 Max, -benchmem -benchtime=3s.

Log Routing: New (service) vs Old (traceID) Baseline

Scenario Version ns/op B/op allocs/op
5E_1RL_100L Old (traceID) 11,341 17,282 225
New (service) 10,406 16,327 219
Δ -8.2% -5.5% -2.7%
5E_1RL_1000L Old (traceID) 92,905 162,224 2,028
New (service) 99,716 153,208 2,021
Δ +7.3% -5.6% -0.3%
10E_3RL_333L Old (traceID) 93,113 174,073 2,079
New (service) 95,356 154,071 2,048
Δ +2.4% -11.5% -1.5%

Summary: Memory usage consistently lower across all scenarios (-5.5% to -11.5% fewer bytes, up to -2.7% fewer allocations). Throughput is comparable — small variations are within benchmark noise.

Optimizations Applied

CPU/memory profiling (-cpuprofile, -memprofile) identified CopyToNewLogRecord as the dominant allocation path (97.7% of heap) with GC consuming 72% of CPU time. Optimizations:

  • Direct batch assignment — first batch for an exporter is assigned directly instead of creating an empty plog.NewLogs() + merge, eliminating one intermediate allocation per routing key
  • Pre-sized maps — all make(map[...]..., capacity) calls use known sizes to avoid rehashing
  • strings.Builder reuse — attribute routing reuses a single builder across loop iterations via Reset()

Cross-Signal Context (not a direct comparison)

Note: These numbers are not directly comparable across signals. Each signal benchmarks different data structures and split granularities:

  • Logs (service routing): splits at ResourceLogs level — 1 CopyTo per resource, all log records travel as one unit
  • Traces (traceID routing): uses batchpersignal.SplitTraces which iterates every individual span, creating a separate ptrace.Traces per traceID — 100 spans = 100 CopyTo operations + 100 hash lookups
  • Metrics (service routing): similar resource-level split as logs, but pmetric.Metrics has a deeper structure (metrics → data points) making CopyTo more expensive per resource

The trace exporter's per-span splitting is architecturally more expensive, which accounts for the apparent ~5× gap. This is an inherent design difference, not a performance issue.

Signal Scenario ns/op B/op allocs/op Split granularity
Logs 5E × 1RL × 100 records 10,406 16,327 219 per resource (1 split)
Traces 5E × 100 spans 56,160 90,502 1,453 per span (100 splits)
Metrics 5E × 1RM × 100 metrics 52,359 94,387 2,431 per resource (1 split, deeper structure)
Logs 5E × 1RL × 1000 records 99,716 153,208 2,021 per resource (1 split)
Traces 5E × 1000 spans 552,894 882,992 14,068 per span (1000 splits)
Metrics 5E × 1RM × 1000 metrics 524,073 929,740 24,031 per resource (1 split, deeper structure)

Link to tracking issue

Fixes #40223

Testing

Unit tests:

  • Split function tests for splitLogsByServiceName, splitLogsByResourceID, splitLogsByAttributes, splitLogsByTraceID with comprehensive edge cases

E2E routing isolation tests:

  • TestE2ERoutingIsolation — 6 services × 20 rounds across 3 endpoints, proves deterministic routing and distribution
  • TestE2EResourceRoutingIsolation — same service, different hosts, proves resource-level isolation

Data integrity tests:

  • TestDataIntegrityThroughRouting — verifies body, severity, timestamps, scope info, and all attributes survive routing without corruption
  • TestDataIntegrityNoCrossContaminationBetweenServices — proves logs from different services never get mixed
  • TestDataIntegrityLogCountPreserved — verifies exact log record count preservation for batch sizes 1-100

Integration and backward compatibility tests:

  • TestConsumeLogsWithTraceIDRouting — traceID routing backward compatibility
  • Consistency tests for both service and resource routing
  • All tests pass with -race flag

Benchmarks:

  • Parameterized across routing keys (service, resource), endpoint counts (1/5/10), and log volumes (100/500/1000)

Documentation

  • Updated README.md routing_key table to show logs support for service, traceID, resource, and attributes
  • Updated routing_key property descriptions with log-specific information
  • Added changelog entry

Assisted-by: Claude Opus 4.6
Support service, resource, and attributes routing keys for logs.
Default routing changed from traceID to service (service.name).
Remove traceID-based log routing as it was not applicable for
most log use cases.

Assisted-by: Claude Opus 4.6
Add unit tests for split functions, integration tests for each routing
mode, consistency tests, and benchmark tests for performance regression
detection.

Assisted-by: Claude Opus 4.6
Update README to reflect that routing_key now works for logs with
service (default), resource, and attributes routing modes.

Assisted-by: Claude Opus 4.6
Add edge case tests for split functions (empty inputs, mixed valid/
invalid service names, data preservation, attribute lookup priority).
Add integration tests for triple-endpoint routing, export failure,
empty batches, partial success with mixed service names, and resource
routing consistency.

Assisted-by: Claude Opus 4.6
@szibis szibis requested a review from a team as a code owner February 20, 2026 18:33
@szibis szibis requested a review from VihasMakwana February 20, 2026 18:33
@github-actions github-actions bot added the first-time contributor PRs made by new contributors label Feb 20, 2026
@github-actions
Copy link
Contributor

Welcome, contributor! Thank you for your contribution to opentelemetry-collector-contrib.

Important reminders:

A maintainer will review your pull request soon. Thank you for helping make OpenTelemetry better!

@szibis szibis force-pushed the feature/logs-hashing-loadbalancer branch from 58d802e to 99ccf46 Compare February 20, 2026 18:38
Add traceID routing support for logs to maintain backward compatibility
with existing configurations. Add e2e routing isolation tests, data
integrity verification tests (no loss/duplication/corruption), and
cross-contamination tests.

Assisted-by: Claude Opus 4.6
- Avoid intermediate plog.NewLogs() allocation on duplicate key merge:
  copy directly into existing result instead of creating temp then merging
- In ConsumeLogs, assign first batch directly to exporter instead of
  creating empty plog.NewLogs() + merge
- Pre-size all maps to avoid rehashing during growth
- Reuse strings.Builder across iterations in splitLogsByAttributes

Assisted-by: Claude Opus 4.6
- Remove unused simpleLogWithID function
- Use assert.Empty/require.Empty instead of Len(x, 0)
- Replace manual map copy loop with maps.Copy
- Fix gofumpt formatting for multi-line slice literal

Assisted-by: Claude Opus 4.6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[loadbalancingexporter] Logs - duplicate traffic + routing_key not working

2 participants