Skip to content

Add Tempo toolset#29

Open
andreasgerstmayr wants to merge 3 commits intorhobs:mainfrom
andreasgerstmayr:tracing
Open

Add Tempo toolset#29
andreasgerstmayr wants to merge 3 commits intorhobs:mainfrom
andreasgerstmayr:tracing

Conversation

@andreasgerstmayr
Copy link
Contributor

Add Tempo tools:

  • tempo_list_instances
  • tempo_search_traces
  • tempo_get_trace_by_id
  • tempo_search_tags
  • tempo_search_tag_values

@andreasgerstmayr andreasgerstmayr requested a review from a team February 6, 2026 13:37
Copy link
Member

@saswatamcode saswatamcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot! Would love to see some evals for this :)

Some comments/questions in the meantime!

turn_metrics:
- "custom:tool_eval"

# TODO: gpt-4o-mini has issues creating valid TraceQL queries.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the purpose of this MCP server would be to make constructing o11y queries more accurate!

Would love to see evals of this enabled alongside Prometheus as well!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included the TraceQL parser and markdown docs from Tempo (as a Go package dependency), but then I realized it's AGPL and embedding it in a statically linked program with ASL 2.0 license violates the AGPL license, so I removed it again before opening the PR.

Do you have suggestions on this? I guess I need to write my own version of the TraceQL docs? :/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can include some common examples of TraceQL and that might work well

@@ -0,0 +1,185 @@
package tempo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can follow the Loader interface pattern that we already do for Prometheus/AM?

func SearchTracesTool() mcp.Tool {
return mcp.NewTool(
"tempo_search_traces",
mcp.WithDescription("Search for traces in Tempo"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe ideally we have more descriptive tool descriptions?

"github.com/mark3labs/mcp-go/mcp"
)

func ListInstancesTool() mcp.Tool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably missing something. How is the LLM going to decide which instance to pick?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LLM should ask the user.

Example from Claude Code:

❯ show me all traces with errors

● I'll help you find traces with errors. First, let me list the available Tempo instances.

● obs-mcp - tempo_list_instances (MCP)
  ⎿  {
       "instances": [
         {
     … +21 lines (ctrl+o to expand)

● I found 2 Tempo instances available:
  ┌──────────┬─────────────────┬────────────────────┬────────┐
  │ Instance │    Namespace    │      Tenants       │ Status │
  ├──────────┼─────────────────┼────────────────────┼────────┤
  │ tempo1   │ obs-mcp-tracing │ project1, project2 │ Ready  │
  ├──────────┼─────────────────┼────────────────────┼────────┤
  │ tempo2   │ obs-mcp-tracing │ stage, prod        │ Ready  │
  └──────────┴─────────────────┴────────────────────┴────────┘
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 ☐ Instance

Which Tempo instance and tenant would you like to search for error traces?

❯ 1. tempo1 / project1
     Query tempo1 instance with project1 tenant
  2. tempo1 / project2
     Query tempo1 instance with project2 tenant
  3. tempo2 / stage
     Query tempo2 instance with stage tenant
  4. tempo2 / prod
     Query tempo2 instance with prod tenant
  5. Type something.
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  6. Chat about this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! Well I'm not sure we have a selector UI for the Aladdin/OLS that this will eventually plug into

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That selector just popped up now (maybe it was a Claude Code update), previously it was a regular text question where I could reply with a text response.


---

## `get_current_time`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be its own tool?

I think it might just be easier to tell the model to use a "NOW" keyword for time params, which you can parse into a proper timestamp in the tool itself.

(I think we already do this for the Prometheus tool)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah, I saw the NOW keyword and prometheus.ParseTimestamp() only after implementing this tool.

In my experience it worked fine once the LLM knows the current time (by using this tool). What's the advantage of telling the LLM to use NOW keyword and parsing NOW+1h etc.? Afaics it could calculate the relative time fine, and passed it as RFC3339 to the tool.

Or am I spoiled with Claude Code and Opus 4.5, and other models don't work well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few more wasted tokens IMO if the model has to call one more tool to get time, on every query, when it could just use Now.

Also, might leave room for some hallucinations.

Claude code by itself is really good, I'd try with some neutral interface probably

"github.com/mark3labs/mcp-go/mcp"
)

func GetTraceByIdTool() mcp.Tool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the model getting traceID? Search traces tool?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, either the user knows the trace id and wants to query it directly, or the LLM should pick it up from the output of the search traces tool.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack! I think this will trip the model up a bit given amount of traces. (Need to figure out context mgmt for this data, pending problems on metrics side too)

"github.com/mark3labs/mcp-go/mcp"
)

func SearchTracesTool() mcp.Tool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, maybe we have some actual guardrails in place?

The goal of guardrails for MCPs like this would be to not only not make super wide queries on the o11y backend, but also to force it to make queries more accurate to lead the user to a correct answer


func NewMCPServer(opts ObsMCPOptions) (*server.MCPServer, error) {
hooks := &server.Hooks{}
hooks.AddBeforeCallTool(func(ctx context.Context, id any, message *mcp.CallToolRequest) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh TIL! That's actually really neat for some instrumentation!

Comment on lines +157 to +161
mcpServer.AddTool(tempo.ListInstancesTool(), tempoToolset.ListInstancesHandler)
mcpServer.AddTool(tempo.GetTraceByIdTool(), tempoToolset.GetTraceByIdHandler)
mcpServer.AddTool(tempo.SearchTracesTool(), tempoToolset.SearchTracesHandler)
mcpServer.AddTool(tempo.SearchTagsTool(), tempoToolset.SearchTagsHandler)
mcpServer.AddTool(tempo.SearchTagValuesTool(), tempoToolset.SearchTagValuesHandler)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'm wondering if these tools can/should be behind a flag?

Ideally user has all o11y backends set up and receiving data, but Prometheus is still the default, always available thing at the moment. Even if a user has tempo, not all of their applications might be sending traces.

So it would be cool to give the user an option to pick the toolset to enable? Wdyt @andreasgerstmayr @iNecas?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point. We can add a CLI arg to enable/disable toolsets like kubernetes-mcp-server does.

@@ -0,0 +1,34 @@
// source: https://github.com/grafana/tempo-operator/blob/4ec7e8b4cb102f52de1e101174f4490b1505eaa1/internal/manifests/naming/sanitize.go
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be cautious of referencing AGPL code this way 😅
There are alternatives to this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry, I missed this. I'll rewrite this code snippet.

Copy link
Member

@slashpai slashpai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits, also unit tests and e2e tests for this would be great :)
I think you should be able to use manifests added in hack/tempo to setup, for prometheus and alertmanager we are using kube-prometheus not sure if something like that exists for tempo

Co-authored-by: Jayapriya Pai <slashpai9@gmail.com>
@openshift-ci
Copy link

openshift-ci bot commented Feb 6, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: andreasgerstmayr
Once this PR has been reviewed and has the lgtm label, please assign saswatamcode for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@andreasgerstmayr
Copy link
Contributor Author

Thank you both for the reviews!
I'll work on the evals (somehow all (existing) evals are currently failing for me, need to fix something in my local env I guess), unit/e2e tests and the other improvements next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants