Skip to content

feat: add more spans to opentelemetry plugin#12686

Open
nic-6443 wants to merge 54 commits intoapache:masterfrom
nic-6443:nic/opentelemetry
Open

feat: add more spans to opentelemetry plugin#12686
nic-6443 wants to merge 54 commits intoapache:masterfrom
nic-6443:nic/opentelemetry

Conversation

@nic-6443
Copy link
Member

@nic-6443 nic-6443 commented Oct 19, 2025

Description

Run jaeger in local by https://www.jaegertracing.io/docs/2.11/getting-started/#all-in-one

image
  • New Features

    • Added comprehensive distributed tracing across request lifecycle (SSL/SNI, access, header/body filter, upstream, and logging).
    • Enhanced OpenTelemetry integration with improved span propagation, context management, and per-request span lifecycle.
    • New tracing configuration option to enable detailed observability.
  • Tests

    • Added OpenTelemetry plugin tracing test suite validating span generation and propagation.

Checklist

  • I have explained the need for this PR and the problem it solves
  • I have explained the changes or the new features added to this PR
  • I have added tests corresponding to this change
  • I have updated the documentation to reflect this change
  • I have verified that this change is backward compatible (If not, please discuss on the APISIX mailing list first)

Signed-off-by: Nic <qianyong@api7.ai>
Signed-off-by: Nic <qianyong@api7.ai>
f
Signed-off-by: Nic <qianyong@api7.ai>
@Revolyssup Revolyssup marked this pull request as ready for review October 23, 2025 14:21
@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. enhancement New feature or request labels Oct 23, 2025
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Oct 23, 2025
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:XXL This PR changes 1000+ lines, ignoring generated files. labels Oct 23, 2025
@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Oct 24, 2025
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Jan 29, 2026
membphis
membphis previously approved these changes Jan 29, 2026
@github-actions github-actions bot removed the stale label Jan 29, 2026
end

inject_core_spans(ctx, api_ctx, conf)
span:set_attributes(attr.int("http.status_code", upstream_status))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use http.response.status_code to comply with the OTEL semantic conventions.

https://opentelemetry.io/docs/specs/semconv/registry/attributes/http/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Comment on lines 305 to 313
-> net.host.name: Str(127.0.0.1)
-> http.method: Str(GET)
-> http.scheme: Str(http)
-> http.target: Str(/anything)
-> http.user_agent: Str(curl/7.64.1)
-> http.target: Str(/headers)
-> http.user_agent: Str(curl/8.16.0)
-> apisix.route_id: Str(otel-tracing-route)
-> apisix.route_name: Empty()
-> http.route: Str(/anything)
-> http.route: Str(/headers)
-> http.status_code: Int(200)
Copy link
Contributor

@bzp2010 bzp2010 Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. check them again against the conventions https://opentelemetry.io/docs/specs/semconv/registry/attributes/http/

For previously added attributes that do not comply with the convention, you may retain them, but please add a new attribute that complies with the convention. Because those old attributes may have been marked as deprecated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Comment on lines 304 to 312
-> net.host.name: Str(127.0.0.1)
-> http.method: Str(GET)
-> http.scheme: Str(http)
-> http.target: Str(/headers)
-> http.user_agent: Str(curl/8.16.0)
-> apisix.route_id: Str(otel-tracing-route)
-> apisix.route_name: Empty()
-> http.route: Str(/headers)
-> http.status_code: Int(200)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Comment on lines 173 to 185
@@ -177,9 +179,10 @@ function _M.match_and_set(api_ctx, match_only, alt_sni)
-- with it sometimes
core.log.error("failed to find any SSL certificate by SNI: ", sni)
end
tracer.finish(api_ctx.ngx_ctx, tracer.status.ERROR, "failed match SNI")
return false
end

tracer.finish(api_ctx.ngx_ctx)
Copy link
Contributor

@bzp2010 bzp2010 Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly, this API design is still strange, with many internal details hidden within the stack data structure. For developers, it presents an opaque data structure.
If any span of code begins with tracer.start but fails to conclude with tracer.finish, the entire tracing process will encounter an error. This unclosed span will encompass all subsequent spans, which is not the intended behavior.

If we examine OTEL implementations in other major programming languages, we find that many of them follow the following pattern:

Use the tracer's method to create a span, then obtain an instance of that span. After internal operations complete, use the span instance's methods to actively terminate the specified span. The ctx will be used to track the current span, ensuring that parent-child relationships are correctly maintained when creating child spans.

-- context-based
local parent_span = tracer.start(ctx, "parent")

local child_span = tracer.start(ctx, "child")

child_span:end()

parent_span:end()

ref: https://opentelemetry.io/docs/languages/go/instrumentation/#create-nested-spans
ref: https://opentelemetry.io/docs/languages/php/instrumentation/#create-nested-spans
ref: https://opentelemetry.io/docs/languages/js/instrumentation/#create-nested-spans
ref: https://opentelemetry.io/docs/languages/java/api/#span
ref: https://opentelemetry.io/docs/languages/cpp/instrumentation/#create-nested-spans

Alternatively, explicitly pass the parent span instance to the child span to achieve precise hierarchical control.

-- pass span directly
local parent_span = tracer.start("parent")

local child_span = tracer.start("child", parent_span)

child_span:end()

parent_span:end()

Note that the following Rust code typically terminates spans automatically via drop calls when the span instance's variable scopes end.

ref: https://docs.rs/tracing/latest/tracing/span/index.html#span-relationships

None of the OTEL libraries for these languages organize and manage the entire span stack using only global state. Each one terminates spans by returning span instances and requiring developers to explicitly call span.end() when operations conclude.

Our current approach requires any developer writing trace code to be familiar with the stack mechanism we've created and to have complete clarity about where their spans appear in the stack, whether parent spans exist, whether they've closed as expected, or any other state. Otherwise, they cannot write correct code. Therefore, I disagree with the current API design.

It's all too easy to make mistakes like the one below.

tracer.start("parent")

tracer.start("child")

tracer.finish()

-- missing last tracer.finish

The result is an output issue, because regardless of what you do, calling finish_all will terminate the span. However, the parent-child relationships and timestamps for all subsequent spans will be incorrect. This won't trigger any obvious error messages, making it extremely difficult for users to debug.


I have already pointed this out in #12686 (comment), and @membphis has clearly reiterated this point through images.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In APISIX, explicitly managing span hierarchies is not practical, especially across multiple modules. The management cost is high and error-prone: if a child span misses a finish call, all subsequent spans may be attached to the wrong parent.

To address this, the model is adjusted to be centered around current_span.
tracer.start only updates the current execution point (current_span), while tracer.finish(span) recursively finishes all child spans of the given span. This avoids hierarchy corruption caused by missing finish calls on child spans.

Example behavior:

-- current_span --> parent
local parent = tracer.start("parent")
-- current_span --> child
local child = tracer.start("child")

-- child missing finish

-- finishing parent will automatically finish child
tracer.finish(parent)

This approach keeps the span hierarchy consistent without increasing the complexity for callers.

And it is pluggable, so when detailed tracing is not enabled, there is no need to make additional redundant judgments when calling start/minimize.

@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:XXL This PR changes 1000+ lines, ignoring generated files. labels Feb 3, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new “comprehensive request lifecycle tracing” mode for the OpenTelemetry integration, instrumenting more APISIX internals and documenting/testing the resulting spans.

Changes:

  • Introduces a core tracing utility (apisix/tracer.lua) and span model (apisix/utils/span.lua) to capture APISIX-internal spans across phases.
  • Instruments multiple request lifecycle points (SSL/SNI matching, DNS resolve, plugin/global-rule execution, phases) and injects them into OpenTelemetry export.
  • Adds apisix.tracing config flag, updates EN/ZH docs, and adds a new tracing-focused test.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
t/plugin/opentelemetry6.t New test covering additional spans being emitted and exported.
docs/zh/latest/plugins/opentelemetry.md Documents enabling comprehensive lifecycle tracing and shows updated collector output.
docs/en/latest/plugins/opentelemetry.md Same as ZH docs, for English readers.
conf/config.yaml.example Adds apisix.tracing example + explanation.
apisix/cli/config.lua Adds default apisix.tracing = false in CLI default config.
apisix/tracer.lua New internal tracer used to create/finish APISIX core spans.
apisix/utils/span.lua New span container type used by the internal tracer.
apisix/init.lua Starts/finishes internal spans across HTTP + SSL phases; finishes all spans in log phase and releases tracing state.
apisix/plugin.lua Wraps plugin execution + global rules with spans.
apisix/utils/upstream.lua Adds a resolve_dns span around domain resolution.
apisix/ssl/router/radixtree_sni.lua Adds a sni_radixtree_match span around SNI routing.
apisix/secret.lua Adds a fetch_secret span around secret retrieval.
apisix/plugins/opentelemetry.lua Adds attributes + injects internal APISIX spans into OpenTelemetry export in log phase.
apisix/core/response.lua Attempts to mark spans as error on >= 400 exits.
Comments suppressed due to low confidence (1)

apisix/secret.lua:157

  • A span is started (span = tracer.start(...)) but on the pcall(require, ...) failure path the function returns without finishing the span, which will leak an unfinished span into ngx.ctx.tracing. Finish the span with an error status before returning.
    local span = tracer.start(ngx.ctx, "fetch_secret", tracer.kind.client)
    local ok, sm = pcall(require, "apisix.secret." .. opts.manager)
    if not ok then
        return nil, "no secret manager: " .. opts.manager
    end

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

-- with it sometimes
core.log.error("failed to find any SSL certificate by SNI: ", sni)
end
tracer.finish(api_ctx.ngx_ctx, span, tracer.status.ERROR, "failed match SNI")
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tracer.finish is called with the wrong argument order here. The signature is finish(ctx, sp, code, message), but this passes (ctx, code, message) which will attempt to index a non-span value and can crash on SNI-miss. Pass the started span as the second argument (and the error code/message as 3rd/4th), and ensure the span is finished before returning.

Copilot uses AI. Check for mistakes.
Comment on lines +37 to +52
function _M.start(ctx, name, kind)
if not enable_tracing then
return
end

local tracing = ctx and ctx.tracing
if not tracing then
local root_span = span.new()
tracing = tablepool.fetch("tracing", 0, 8)
tracing.spans = tablepool.fetch("tracing_spans", 20, 0)
tracing.root_span = root_span
tracing.current_span = root_span
table.insert(tracing.spans, root_span)
root_span.id = 1
ctx.tracing = tracing
end
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start() can be called with ctx == nil (call sites already pass ngx.ctx in some places, but this function itself allows ctx to be nil via ctx and ctx.tracing). However it unconditionally assigns ctx.tracing = tracing, which will throw if ctx is nil. Add an early return when ctx is nil (or require callers to always pass a non-nil ctx).

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

apisix/secret.lua:157

  • A span is started before pcall(require, ...), but when require fails this function returns without finishing the span. That leaves an open child span in ngx.ctx.tracing and can skew parent/child tracking until finish_all runs. Finish the span on this error path as well (and set ERROR status/message).
    local span = tracer.start(ngx.ctx, "fetch_secret", tracer.kind.client)
    local ok, sm = pcall(require, "apisix.secret." .. opts.manager)
    if not ok then
        return nil, "no secret manager: " .. opts.manager
    end

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants