feat: add more spans to opentelemetry plugin#12686
feat: add more spans to opentelemetry plugin#12686nic-6443 wants to merge 54 commits intoapache:masterfrom
Conversation
8d48eeb to
359ddc4
Compare
apisix/plugins/opentelemetry.lua
Outdated
| end | ||
|
|
||
| inject_core_spans(ctx, api_ctx, conf) | ||
| span:set_attributes(attr.int("http.status_code", upstream_status)) |
There was a problem hiding this comment.
Use http.response.status_code to comply with the OTEL semantic conventions.
https://opentelemetry.io/docs/specs/semconv/registry/attributes/http/
| -> net.host.name: Str(127.0.0.1) | ||
| -> http.method: Str(GET) | ||
| -> http.scheme: Str(http) | ||
| -> http.target: Str(/anything) | ||
| -> http.user_agent: Str(curl/7.64.1) | ||
| -> http.target: Str(/headers) | ||
| -> http.user_agent: Str(curl/8.16.0) | ||
| -> apisix.route_id: Str(otel-tracing-route) | ||
| -> apisix.route_name: Empty() | ||
| -> http.route: Str(/anything) | ||
| -> http.route: Str(/headers) | ||
| -> http.status_code: Int(200) |
There was a problem hiding this comment.
Ditto. check them again against the conventions https://opentelemetry.io/docs/specs/semconv/registry/attributes/http/
For previously added attributes that do not comply with the convention, you may retain them, but please add a new attribute that complies with the convention. Because those old attributes may have been marked as deprecated.
| -> net.host.name: Str(127.0.0.1) | ||
| -> http.method: Str(GET) | ||
| -> http.scheme: Str(http) | ||
| -> http.target: Str(/headers) | ||
| -> http.user_agent: Str(curl/8.16.0) | ||
| -> apisix.route_id: Str(otel-tracing-route) | ||
| -> apisix.route_name: Empty() | ||
| -> http.route: Str(/headers) | ||
| -> http.status_code: Int(200) |
apisix/ssl/router/radixtree_sni.lua
Outdated
| @@ -177,9 +179,10 @@ function _M.match_and_set(api_ctx, match_only, alt_sni) | |||
| -- with it sometimes | |||
| core.log.error("failed to find any SSL certificate by SNI: ", sni) | |||
| end | |||
| tracer.finish(api_ctx.ngx_ctx, tracer.status.ERROR, "failed match SNI") | |||
| return false | |||
| end | |||
|
|
|||
| tracer.finish(api_ctx.ngx_ctx) | |||
There was a problem hiding this comment.
Strictly, this API design is still strange, with many internal details hidden within the stack data structure. For developers, it presents an opaque data structure.
If any span of code begins with tracer.start but fails to conclude with tracer.finish, the entire tracing process will encounter an error. This unclosed span will encompass all subsequent spans, which is not the intended behavior.
If we examine OTEL implementations in other major programming languages, we find that many of them follow the following pattern:
Use the tracer's method to create a span, then obtain an instance of that span. After internal operations complete, use the span instance's methods to actively terminate the specified span. The ctx will be used to track the current span, ensuring that parent-child relationships are correctly maintained when creating child spans.
-- context-based
local parent_span = tracer.start(ctx, "parent")
local child_span = tracer.start(ctx, "child")
child_span:end()
parent_span:end()ref: https://opentelemetry.io/docs/languages/go/instrumentation/#create-nested-spans
ref: https://opentelemetry.io/docs/languages/php/instrumentation/#create-nested-spans
ref: https://opentelemetry.io/docs/languages/js/instrumentation/#create-nested-spans
ref: https://opentelemetry.io/docs/languages/java/api/#span
ref: https://opentelemetry.io/docs/languages/cpp/instrumentation/#create-nested-spans
Alternatively, explicitly pass the parent span instance to the child span to achieve precise hierarchical control.
-- pass span directly
local parent_span = tracer.start("parent")
local child_span = tracer.start("child", parent_span)
child_span:end()
parent_span:end()Note that the following Rust code typically terminates spans automatically via drop calls when the span instance's variable scopes end.
ref: https://docs.rs/tracing/latest/tracing/span/index.html#span-relationships
None of the OTEL libraries for these languages organize and manage the entire span stack using only global state. Each one terminates spans by returning span instances and requiring developers to explicitly call span.end() when operations conclude.
Our current approach requires any developer writing trace code to be familiar with the stack mechanism we've created and to have complete clarity about where their spans appear in the stack, whether parent spans exist, whether they've closed as expected, or any other state. Otherwise, they cannot write correct code. Therefore, I disagree with the current API design.
It's all too easy to make mistakes like the one below.
tracer.start("parent")
tracer.start("child")
tracer.finish()
-- missing last tracer.finish
The result is an output issue, because regardless of what you do, calling finish_all will terminate the span. However, the parent-child relationships and timestamps for all subsequent spans will be incorrect. This won't trigger any obvious error messages, making it extremely difficult for users to debug.
I have already pointed this out in #12686 (comment), and @membphis has clearly reiterated this point through images.
There was a problem hiding this comment.
In APISIX, explicitly managing span hierarchies is not practical, especially across multiple modules. The management cost is high and error-prone: if a child span misses a finish call, all subsequent spans may be attached to the wrong parent.
To address this, the model is adjusted to be centered around current_span.
tracer.start only updates the current execution point (current_span), while tracer.finish(span) recursively finishes all child spans of the given span. This avoids hierarchy corruption caused by missing finish calls on child spans.
Example behavior:
-- current_span --> parent
local parent = tracer.start("parent")
-- current_span --> child
local child = tracer.start("child")
-- child missing finish
-- finishing parent will automatically finish child
tracer.finish(parent)This approach keeps the span hierarchy consistent without increasing the complexity for callers.
And it is pluggable, so when detailed tracing is not enabled, there is no need to make additional redundant judgments when calling start/minimize.
dba7342 to
d3f9bd4
Compare
9e29643 to
9616ff5
Compare
There was a problem hiding this comment.
Pull request overview
Adds a new “comprehensive request lifecycle tracing” mode for the OpenTelemetry integration, instrumenting more APISIX internals and documenting/testing the resulting spans.
Changes:
- Introduces a core tracing utility (
apisix/tracer.lua) and span model (apisix/utils/span.lua) to capture APISIX-internal spans across phases. - Instruments multiple request lifecycle points (SSL/SNI matching, DNS resolve, plugin/global-rule execution, phases) and injects them into OpenTelemetry export.
- Adds
apisix.tracingconfig flag, updates EN/ZH docs, and adds a new tracing-focused test.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
t/plugin/opentelemetry6.t |
New test covering additional spans being emitted and exported. |
docs/zh/latest/plugins/opentelemetry.md |
Documents enabling comprehensive lifecycle tracing and shows updated collector output. |
docs/en/latest/plugins/opentelemetry.md |
Same as ZH docs, for English readers. |
conf/config.yaml.example |
Adds apisix.tracing example + explanation. |
apisix/cli/config.lua |
Adds default apisix.tracing = false in CLI default config. |
apisix/tracer.lua |
New internal tracer used to create/finish APISIX core spans. |
apisix/utils/span.lua |
New span container type used by the internal tracer. |
apisix/init.lua |
Starts/finishes internal spans across HTTP + SSL phases; finishes all spans in log phase and releases tracing state. |
apisix/plugin.lua |
Wraps plugin execution + global rules with spans. |
apisix/utils/upstream.lua |
Adds a resolve_dns span around domain resolution. |
apisix/ssl/router/radixtree_sni.lua |
Adds a sni_radixtree_match span around SNI routing. |
apisix/secret.lua |
Adds a fetch_secret span around secret retrieval. |
apisix/plugins/opentelemetry.lua |
Adds attributes + injects internal APISIX spans into OpenTelemetry export in log phase. |
apisix/core/response.lua |
Attempts to mark spans as error on >= 400 exits. |
Comments suppressed due to low confidence (1)
apisix/secret.lua:157
- A span is started (
span = tracer.start(...)) but on thepcall(require, ...)failure path the function returns without finishing the span, which will leak an unfinished span intongx.ctx.tracing. Finish the span with an error status before returning.
local span = tracer.start(ngx.ctx, "fetch_secret", tracer.kind.client)
local ok, sm = pcall(require, "apisix.secret." .. opts.manager)
if not ok then
return nil, "no secret manager: " .. opts.manager
end
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| -- with it sometimes | ||
| core.log.error("failed to find any SSL certificate by SNI: ", sni) | ||
| end | ||
| tracer.finish(api_ctx.ngx_ctx, span, tracer.status.ERROR, "failed match SNI") |
There was a problem hiding this comment.
tracer.finish is called with the wrong argument order here. The signature is finish(ctx, sp, code, message), but this passes (ctx, code, message) which will attempt to index a non-span value and can crash on SNI-miss. Pass the started span as the second argument (and the error code/message as 3rd/4th), and ensure the span is finished before returning.
| function _M.start(ctx, name, kind) | ||
| if not enable_tracing then | ||
| return | ||
| end | ||
|
|
||
| local tracing = ctx and ctx.tracing | ||
| if not tracing then | ||
| local root_span = span.new() | ||
| tracing = tablepool.fetch("tracing", 0, 8) | ||
| tracing.spans = tablepool.fetch("tracing_spans", 20, 0) | ||
| tracing.root_span = root_span | ||
| tracing.current_span = root_span | ||
| table.insert(tracing.spans, root_span) | ||
| root_span.id = 1 | ||
| ctx.tracing = tracing | ||
| end |
There was a problem hiding this comment.
start() can be called with ctx == nil (call sites already pass ngx.ctx in some places, but this function itself allows ctx to be nil via ctx and ctx.tracing). However it unconditionally assigns ctx.tracing = tracing, which will throw if ctx is nil. Add an early return when ctx is nil (or require callers to always pass a non-nil ctx).
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
apisix/secret.lua:157
- A span is started before
pcall(require, ...), but whenrequirefails this function returns without finishing the span. That leaves an open child span inngx.ctx.tracingand can skew parent/child tracking untilfinish_allruns. Finish the span on this error path as well (and set ERROR status/message).
local span = tracer.start(ngx.ctx, "fetch_secret", tracer.kind.client)
local ok, sm = pcall(require, "apisix.secret." .. opts.manager)
if not ok then
return nil, "no secret manager: " .. opts.manager
end
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Description
Run jaeger in local by https://www.jaegertracing.io/docs/2.11/getting-started/#all-in-one
New Features
tracingconfiguration option to enable detailed observability.Tests
Checklist