[exporter/loadbalancing] Add routing key support for logs by szibis · Pull Request #46241 · open-telemetry/opentelemetry-collector-contrib

szibis · 2026-02-20T18:33:38Z

Description

Implements proper routing_key support for logs in the loadbalancing exporter. Fully backward compatible — existing configurations continue to work without changes.

The problem: The previous log routing implementation used traceID-based hashing regardless of the configured routing_key. Since most logs don't carry trace IDs, this meant routing_key: "service" had no effect — logs were essentially randomly distributed across backends, making stateful downstream processing (throttling, rate-limiting, deduplication) per-service impossible.

The fix: Rewrite the log exporter to support configurable routing modes, mirroring the existing metrics exporter architecture:

`routing_key`	Routing behavior
`service` (default)	Hash on `service.name` resource attribute
`traceID`	Hash on first log record's traceID (backward compatible)
`resource`	Hash on full resource identity (all resource attributes)
`attributes`	Hash on configurable attributes from resource, scope, or log record levels

Backward compatibility: The traceID routing key is preserved for logs, so existing configurations with routing_key: "traceID" continue to work identically. The default routing key changes from implicit traceID to explicit service — this is the intended fix, as the old behavior was effectively random for most logs (since they lack traceIDs). Users who explicitly want the old behavior can set routing_key: "traceID".

Use case — Per-service log processing with stateful backends:

A common production pattern is collecting logs via filelog across nodes, then routing through a loadbalancing exporter to a fleet of stateful backend collectors (e.g. OTel Collector aggregators in a StatefulSet) for per-service processing like rate limiting, log reduction, or tail-based sampling. Without proper routing, the same service's logs scatter across all backend pods, making per-service processing ineffective (you get per-pod limits instead of per-service limits).

flowchart LR
    subgraph K8s Nodes
        FL1[filelog receiver<br/>Node 1] 
        FL2[filelog receiver<br/>Node 2]
        FL3[filelog receiver<br/>Node 3]
    end

    subgraph Gateway Collectors
        LB1[loadbalancing exporter<br/>routing_key: service]
        LB2[loadbalancing exporter<br/>routing_key: service]
        LB3[loadbalancing exporter<br/>routing_key: service]
    end

    subgraph Aggregator Collectors - StatefulSet
        A0[collector-0<br/>rate-limit: svc-a, svc-b]
        A1[collector-1<br/>rate-limit: svc-c, svc-d]
        A2[collector-2<br/>rate-limit: svc-e, svc-f]
    end

    FL1 --> LB1
    FL2 --> LB2
    FL3 --> LB3

    LB1 -->|svc-a logs| A0
    LB1 -->|svc-c logs| A1
    LB2 -->|svc-a logs| A0
    LB2 -->|svc-d logs| A1
    LB3 -->|svc-b logs| A0
    LB3 -->|svc-f logs| A2

    style A0 fill:#2d6,stroke:#333,color:#fff
    style A1 fill:#26d,stroke:#333,color:#fff
    style A2 fill:#d62,stroke:#333,color:#fff

Key property: All gateway collectors consistently route svc-a logs to the same aggregator pod (via consistent hashing), regardless of which node collected them. This makes per-service rate limiting effective because a single aggregator sees 100% of a given service's logs.

Before this PR: routing_key: "service" was silently ignored for logs — traffic was routed by traceID (effectively random for most logs), causing scattered delivery as reported in the linked issue.

After this PR: Logs are consistently routed by service.name (or resource identity, or custom attributes), enabling reliable per-service stateful processing.

Benchmark Results

All benchmarks run on Apple M3 Max, -benchmem -benchtime=3s.

Log Routing: New (service) vs Old (traceID) Baseline

Scenario	Version	ns/op	B/op	allocs/op
5E_1RL_100L	Old (traceID)	11,341	17,282	225
	New (service)	10,406	16,327	219
	Δ	-8.2%	-5.5%	-2.7%
5E_1RL_1000L	Old (traceID)	92,905	162,224	2,028
	New (service)	99,716	153,208	2,021
	Δ	+7.3%	-5.6%	-0.3%
10E_3RL_333L	Old (traceID)	93,113	174,073	2,079
	New (service)	95,356	154,071	2,048
	Δ	+2.4%	-11.5%	-1.5%

Summary: Memory usage consistently lower across all scenarios (-5.5% to -11.5% fewer bytes, up to -2.7% fewer allocations). Throughput is comparable — small variations are within benchmark noise.

Optimizations Applied

CPU/memory profiling (-cpuprofile, -memprofile) identified CopyTo → NewLogRecord as the dominant allocation path (97.7% of heap) with GC consuming 72% of CPU time. Optimizations:

Direct batch assignment — first batch for an exporter is assigned directly instead of creating an empty plog.NewLogs() + merge, eliminating one intermediate allocation per routing key
Pre-sized maps — all make(map[...]..., capacity) calls use known sizes to avoid rehashing
strings.Builder reuse — attribute routing reuses a single builder across loop iterations via Reset()

Cross-Signal Context (not a direct comparison)

Note: These numbers are not directly comparable across signals. Each signal benchmarks different data structures and split granularities:

Logs (service routing): splits at ResourceLogs level — 1 CopyTo per resource, all log records travel as one unit

Traces (traceID routing): uses batchpersignal.SplitTraces which iterates every individual span, creating a separate ptrace.Traces per traceID — 100 spans = 100 CopyTo operations + 100 hash lookups

Metrics (service routing): similar resource-level split as logs, but pmetric.Metrics has a deeper structure (metrics → data points) making CopyTo more expensive per resource

The trace exporter's per-span splitting is architecturally more expensive, which accounts for the apparent ~5× gap. This is an inherent design difference, not a performance issue.

Signal	Scenario	ns/op	B/op	allocs/op	Split granularity
Logs	5E × 1RL × 100 records	10,406	16,327	219	per resource (1 split)
Traces	5E × 100 spans	56,160	90,502	1,453	per span (100 splits)
Metrics	5E × 1RM × 100 metrics	52,359	94,387	2,431	per resource (1 split, deeper structure)

Logs	5E × 1RL × 1000 records	99,716	153,208	2,021	per resource (1 split)
Traces	5E × 1000 spans	552,894	882,992	14,068	per span (1000 splits)
Metrics	5E × 1RM × 1000 metrics	524,073	929,740	24,031	per resource (1 split, deeper structure)

Link to tracking issue

Fixes #40223

Testing

Unit tests:

Split function tests for splitLogsByServiceName, splitLogsByResourceID, splitLogsByAttributes, splitLogsByTraceID with comprehensive edge cases

E2E routing isolation tests:

TestE2ERoutingIsolation — 6 services × 20 rounds across 3 endpoints, proves deterministic routing and distribution
TestE2EResourceRoutingIsolation — same service, different hosts, proves resource-level isolation

Data integrity tests:

TestDataIntegrityThroughRouting — verifies body, severity, timestamps, scope info, and all attributes survive routing without corruption
TestDataIntegrityNoCrossContaminationBetweenServices — proves logs from different services never get mixed
TestDataIntegrityLogCountPreserved — verifies exact log record count preservation for batch sizes 1-100

Integration and backward compatibility tests:

TestConsumeLogsWithTraceIDRouting — traceID routing backward compatibility
Consistency tests for both service and resource routing
All tests pass with -race flag

Benchmarks:

Parameterized across routing keys (service, resource), endpoint counts (1/5/10), and log volumes (100/500/1000)

Documentation

Updated README.md routing_key table to show logs support for service, traceID, resource, and attributes
Updated routing_key property descriptions with log-specific information
Added changelog entry

…g exporter Assisted-by: Claude Opus 4.6

Assisted-by: Claude Opus 4.6

Support service, resource, and attributes routing keys for logs. Default routing changed from traceID to service (service.name). Remove traceID-based log routing as it was not applicable for most log use cases. Assisted-by: Claude Opus 4.6

Add unit tests for split functions, integration tests for each routing mode, consistency tests, and benchmark tests for performance regression detection. Assisted-by: Claude Opus 4.6

Update README to reflect that routing_key now works for logs with service (default), resource, and attributes routing modes. Assisted-by: Claude Opus 4.6

Add edge case tests for split functions (empty inputs, mixed valid/ invalid service names, data preservation, attribute lookup priority). Add integration tests for triple-endpoint routing, export failure, empty batches, partial success with mixed service names, and resource routing consistency. Assisted-by: Claude Opus 4.6

github-actions · 2026-02-20T18:33:50Z

Welcome, contributor! Thank you for your contribution to opentelemetry-collector-contrib.

Important reminders:

Please review our Contributing Guidelines.
Don't forget to sign the Contributor License Agreement (CLA) if you haven't already.

A maintainer will review your pull request soon. Thank you for helping make OpenTelemetry better!

Assisted-by: Claude Opus 4.6

Add traceID routing support for logs to maintain backward compatibility with existing configurations. Add e2e routing isolation tests, data integrity verification tests (no loss/duplication/corruption), and cross-contamination tests. Assisted-by: Claude Opus 4.6

Assisted-by: Claude Opus 4.6

- Avoid intermediate plog.NewLogs() allocation on duplicate key merge: copy directly into existing result instead of creating temp then merging - In ConsumeLogs, assign first batch directly to exporter instead of creating empty plog.NewLogs() + merge - Pre-size all maps to avoid rehashing during growth - Reuse strings.Builder across iterations in splitLogsByAttributes Assisted-by: Claude Opus 4.6

- Remove unused simpleLogWithID function - Use assert.Empty/require.Empty instead of Len(x, 0) - Replace manual map copy loop with maps.Copy - Fix gofumpt formatting for multi-line slice literal Assisted-by: Claude Opus 4.6

szibis added 7 commits February 20, 2026 18:28

Add design doc for log routing by resource attributes in loadbalancin…

39db48b

…g exporter Assisted-by: Claude Opus 4.6

Remove temporary design doc from branch

eef4423

Assisted-by: Claude Opus 4.6

feat(loadbalancingexporter): add mergeLogs helper

610fefd

Assisted-by: Claude Opus 4.6

test(loadbalancingexporter): add tests for log routing keys

aa8d4e1

Add unit tests for split functions, integration tests for each routing mode, consistency tests, and benchmark tests for performance regression detection. Assisted-by: Claude Opus 4.6

docs(loadbalancingexporter): document log routing key support

1d17cd9

Update README to reflect that routing_key now works for logs with service (default), resource, and attributes routing modes. Assisted-by: Claude Opus 4.6

szibis requested a review from a team as a code owner February 20, 2026 18:33

szibis requested a review from VihasMakwana February 20, 2026 18:33

github-actions bot assigned andrzej-stencel Feb 20, 2026

github-actions bot added the first-time contributor PRs made by new contributors label Feb 20, 2026

github-actions bot added the exporter/loadbalancing label Feb 20, 2026

github-actions bot requested a review from rlankfo February 20, 2026 18:33

chore(loadbalancingexporter): add changelog entry for log routing keys

99ccf46

Assisted-by: Claude Opus 4.6

szibis force-pushed the feature/logs-hashing-loadbalancer branch from 58d802e to 99ccf46 Compare February 20, 2026 18:38

szibis added 4 commits February 20, 2026 19:56

docs(loadbalancingexporter): add traceID to logs routing key table

92a31cd

Assisted-by: Claude Opus 4.6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[exporter/loadbalancing] Add routing key support for logs#46241

[exporter/loadbalancing] Add routing key support for logs#46241
szibis wants to merge 12 commits intoopen-telemetry:mainfrom
szibis:feature/logs-hashing-loadbalancer

szibis commented Feb 20, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

szibis commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Benchmark Results

Log Routing: New (service) vs Old (traceID) Baseline

Optimizations Applied

Cross-Signal Context (not a direct comparison)

Link to tracking issue

Testing

Documentation

Uh oh!

github-actions bot commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

szibis commented Feb 20, 2026 •

edited

Loading