[exporter/loadbalancing] Add routing key support for logs#46241
Open
szibis wants to merge 12 commits intoopen-telemetry:mainfrom
Open
[exporter/loadbalancing] Add routing key support for logs#46241szibis wants to merge 12 commits intoopen-telemetry:mainfrom
szibis wants to merge 12 commits intoopen-telemetry:mainfrom
Conversation
…g exporter Assisted-by: Claude Opus 4.6
Assisted-by: Claude Opus 4.6
Assisted-by: Claude Opus 4.6
Support service, resource, and attributes routing keys for logs. Default routing changed from traceID to service (service.name). Remove traceID-based log routing as it was not applicable for most log use cases. Assisted-by: Claude Opus 4.6
Add unit tests for split functions, integration tests for each routing mode, consistency tests, and benchmark tests for performance regression detection. Assisted-by: Claude Opus 4.6
Update README to reflect that routing_key now works for logs with service (default), resource, and attributes routing modes. Assisted-by: Claude Opus 4.6
Add edge case tests for split functions (empty inputs, mixed valid/ invalid service names, data preservation, attribute lookup priority). Add integration tests for triple-endpoint routing, export failure, empty batches, partial success with mixed service names, and resource routing consistency. Assisted-by: Claude Opus 4.6
Contributor
|
Welcome, contributor! Thank you for your contribution to opentelemetry-collector-contrib. Important reminders:
A maintainer will review your pull request soon. Thank you for helping make OpenTelemetry better! |
Assisted-by: Claude Opus 4.6
58d802e to
99ccf46
Compare
Add traceID routing support for logs to maintain backward compatibility with existing configurations. Add e2e routing isolation tests, data integrity verification tests (no loss/duplication/corruption), and cross-contamination tests. Assisted-by: Claude Opus 4.6
Assisted-by: Claude Opus 4.6
- Avoid intermediate plog.NewLogs() allocation on duplicate key merge: copy directly into existing result instead of creating temp then merging - In ConsumeLogs, assign first batch directly to exporter instead of creating empty plog.NewLogs() + merge - Pre-size all maps to avoid rehashing during growth - Reuse strings.Builder across iterations in splitLogsByAttributes Assisted-by: Claude Opus 4.6
- Remove unused simpleLogWithID function - Use assert.Empty/require.Empty instead of Len(x, 0) - Replace manual map copy loop with maps.Copy - Fix gofumpt formatting for multi-line slice literal Assisted-by: Claude Opus 4.6
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Implements proper
routing_keysupport for logs in the loadbalancing exporter. Fully backward compatible — existing configurations continue to work without changes.The problem: The previous log routing implementation used traceID-based hashing regardless of the configured
routing_key. Since most logs don't carry trace IDs, this meantrouting_key: "service"had no effect — logs were essentially randomly distributed across backends, making stateful downstream processing (throttling, rate-limiting, deduplication) per-service impossible.The fix: Rewrite the log exporter to support configurable routing modes, mirroring the existing metrics exporter architecture:
routing_keyservice(default)service.nameresource attributetraceIDresourceattributesBackward compatibility: The
traceIDrouting key is preserved for logs, so existing configurations withrouting_key: "traceID"continue to work identically. The default routing key changes from implicit traceID to explicitservice— this is the intended fix, as the old behavior was effectively random for most logs (since they lack traceIDs). Users who explicitly want the old behavior can setrouting_key: "traceID".Use case — Per-service log processing with stateful backends:
A common production pattern is collecting logs via filelog across nodes, then routing through a loadbalancing exporter to a fleet of stateful backend collectors (e.g. OTel Collector aggregators in a StatefulSet) for per-service processing like rate limiting, log reduction, or tail-based sampling. Without proper routing, the same service's logs scatter across all backend pods, making per-service processing ineffective (you get per-pod limits instead of per-service limits).
flowchart LR subgraph K8s Nodes FL1[filelog receiver<br/>Node 1] FL2[filelog receiver<br/>Node 2] FL3[filelog receiver<br/>Node 3] end subgraph Gateway Collectors LB1[loadbalancing exporter<br/>routing_key: service] LB2[loadbalancing exporter<br/>routing_key: service] LB3[loadbalancing exporter<br/>routing_key: service] end subgraph Aggregator Collectors - StatefulSet A0[collector-0<br/>rate-limit: svc-a, svc-b] A1[collector-1<br/>rate-limit: svc-c, svc-d] A2[collector-2<br/>rate-limit: svc-e, svc-f] end FL1 --> LB1 FL2 --> LB2 FL3 --> LB3 LB1 -->|svc-a logs| A0 LB1 -->|svc-c logs| A1 LB2 -->|svc-a logs| A0 LB2 -->|svc-d logs| A1 LB3 -->|svc-b logs| A0 LB3 -->|svc-f logs| A2 style A0 fill:#2d6,stroke:#333,color:#fff style A1 fill:#26d,stroke:#333,color:#fff style A2 fill:#d62,stroke:#333,color:#fffKey property: All gateway collectors consistently route
svc-alogs to the same aggregator pod (via consistent hashing), regardless of which node collected them. This makes per-service rate limiting effective because a single aggregator sees 100% of a given service's logs.Before this PR:
routing_key: "service"was silently ignored for logs — traffic was routed by traceID (effectively random for most logs), causing scattered delivery as reported in the linked issue.After this PR: Logs are consistently routed by
service.name(or resource identity, or custom attributes), enabling reliable per-service stateful processing.Benchmark Results
All benchmarks run on Apple M3 Max,
-benchmem -benchtime=3s.Log Routing: New (service) vs Old (traceID) Baseline
Summary: Memory usage consistently lower across all scenarios (-5.5% to -11.5% fewer bytes, up to -2.7% fewer allocations). Throughput is comparable — small variations are within benchmark noise.
Optimizations Applied
CPU/memory profiling (
-cpuprofile,-memprofile) identifiedCopyTo→NewLogRecordas the dominant allocation path (97.7% of heap) with GC consuming 72% of CPU time. Optimizations:plog.NewLogs()+ merge, eliminating one intermediate allocation per routing keymake(map[...]..., capacity)calls use known sizes to avoid rehashingstrings.Builderreuse — attribute routing reuses a single builder across loop iterations viaReset()Cross-Signal Context (not a direct comparison)
Link to tracking issue
Fixes #40223
Testing
Unit tests:
splitLogsByServiceName,splitLogsByResourceID,splitLogsByAttributes,splitLogsByTraceIDwith comprehensive edge casesE2E routing isolation tests:
TestE2ERoutingIsolation— 6 services × 20 rounds across 3 endpoints, proves deterministic routing and distributionTestE2EResourceRoutingIsolation— same service, different hosts, proves resource-level isolationData integrity tests:
TestDataIntegrityThroughRouting— verifies body, severity, timestamps, scope info, and all attributes survive routing without corruptionTestDataIntegrityNoCrossContaminationBetweenServices— proves logs from different services never get mixedTestDataIntegrityLogCountPreserved— verifies exact log record count preservation for batch sizes 1-100Integration and backward compatibility tests:
TestConsumeLogsWithTraceIDRouting— traceID routing backward compatibility-raceflagBenchmarks:
Documentation
README.mdrouting_key table to show logs support forservice,traceID,resource, andattributes