Conversation
saswatamcode
left a comment
There was a problem hiding this comment.
Thanks a lot! Would love to see some evals for this :)
Some comments/questions in the meantime!
| turn_metrics: | ||
| - "custom:tool_eval" | ||
|
|
||
| # TODO: gpt-4o-mini has issues creating valid TraceQL queries. |
There was a problem hiding this comment.
I think the purpose of this MCP server would be to make constructing o11y queries more accurate!
Would love to see evals of this enabled alongside Prometheus as well!
There was a problem hiding this comment.
I included the TraceQL parser and markdown docs from Tempo (as a Go package dependency), but then I realized it's AGPL and embedding it in a statically linked program with ASL 2.0 license violates the AGPL license, so I removed it again before opening the PR.
Do you have suggestions on this? I guess I need to write my own version of the TraceQL docs? :/
There was a problem hiding this comment.
I think you can include some common examples of TraceQL and that might work well
| @@ -0,0 +1,185 @@ | |||
| package tempo | |||
There was a problem hiding this comment.
Maybe we can follow the Loader interface pattern that we already do for Prometheus/AM?
| func SearchTracesTool() mcp.Tool { | ||
| return mcp.NewTool( | ||
| "tempo_search_traces", | ||
| mcp.WithDescription("Search for traces in Tempo"), |
There was a problem hiding this comment.
Maybe ideally we have more descriptive tool descriptions?
| "github.com/mark3labs/mcp-go/mcp" | ||
| ) | ||
|
|
||
| func ListInstancesTool() mcp.Tool { |
There was a problem hiding this comment.
I'm probably missing something. How is the LLM going to decide which instance to pick?
There was a problem hiding this comment.
The LLM should ask the user.
Example from Claude Code:
❯ show me all traces with errors
● I'll help you find traces with errors. First, let me list the available Tempo instances.
● obs-mcp - tempo_list_instances (MCP)
⎿ {
"instances": [
{
… +21 lines (ctrl+o to expand)
● I found 2 Tempo instances available:
┌──────────┬─────────────────┬────────────────────┬────────┐
│ Instance │ Namespace │ Tenants │ Status │
├──────────┼─────────────────┼────────────────────┼────────┤
│ tempo1 │ obs-mcp-tracing │ project1, project2 │ Ready │
├──────────┼─────────────────┼────────────────────┼────────┤
│ tempo2 │ obs-mcp-tracing │ stage, prod │ Ready │
└──────────┴─────────────────┴────────────────────┴────────┘
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
☐ Instance
Which Tempo instance and tenant would you like to search for error traces?
❯ 1. tempo1 / project1
Query tempo1 instance with project1 tenant
2. tempo1 / project2
Query tempo1 instance with project2 tenant
3. tempo2 / stage
Query tempo2 instance with stage tenant
4. tempo2 / prod
Query tempo2 instance with prod tenant
5. Type something.
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
6. Chat about this
There was a problem hiding this comment.
Oh! Well I'm not sure we have a selector UI for the Aladdin/OLS that this will eventually plug into
There was a problem hiding this comment.
That selector just popped up now (maybe it was a Claude Code update), previously it was a regular text question where I could reply with a text response.
|
|
||
| --- | ||
|
|
||
| ## `get_current_time` |
There was a problem hiding this comment.
Does this need to be its own tool?
I think it might just be easier to tell the model to use a "NOW" keyword for time params, which you can parse into a proper timestamp in the tool itself.
(I think we already do this for the Prometheus tool)
There was a problem hiding this comment.
Ah, yeah, I saw the NOW keyword and prometheus.ParseTimestamp() only after implementing this tool.
In my experience it worked fine once the LLM knows the current time (by using this tool). What's the advantage of telling the LLM to use NOW keyword and parsing NOW+1h etc.? Afaics it could calculate the relative time fine, and passed it as RFC3339 to the tool.
Or am I spoiled with Claude Code and Opus 4.5, and other models don't work well?
There was a problem hiding this comment.
Just a few more wasted tokens IMO if the model has to call one more tool to get time, on every query, when it could just use Now.
Also, might leave room for some hallucinations.
Claude code by itself is really good, I'd try with some neutral interface probably
| "github.com/mark3labs/mcp-go/mcp" | ||
| ) | ||
|
|
||
| func GetTraceByIdTool() mcp.Tool { |
There was a problem hiding this comment.
How is the model getting traceID? Search traces tool?
There was a problem hiding this comment.
Yes, either the user knows the trace id and wants to query it directly, or the LLM should pick it up from the output of the search traces tool.
There was a problem hiding this comment.
Ack! I think this will trip the model up a bit given amount of traces. (Need to figure out context mgmt for this data, pending problems on metrics side too)
| "github.com/mark3labs/mcp-go/mcp" | ||
| ) | ||
|
|
||
| func SearchTracesTool() mcp.Tool { |
There was a problem hiding this comment.
Also, maybe we have some actual guardrails in place?
The goal of guardrails for MCPs like this would be to not only not make super wide queries on the o11y backend, but also to force it to make queries more accurate to lead the user to a correct answer
|
|
||
| func NewMCPServer(opts ObsMCPOptions) (*server.MCPServer, error) { | ||
| hooks := &server.Hooks{} | ||
| hooks.AddBeforeCallTool(func(ctx context.Context, id any, message *mcp.CallToolRequest) { |
There was a problem hiding this comment.
Oh TIL! That's actually really neat for some instrumentation!
| mcpServer.AddTool(tempo.ListInstancesTool(), tempoToolset.ListInstancesHandler) | ||
| mcpServer.AddTool(tempo.GetTraceByIdTool(), tempoToolset.GetTraceByIdHandler) | ||
| mcpServer.AddTool(tempo.SearchTracesTool(), tempoToolset.SearchTracesHandler) | ||
| mcpServer.AddTool(tempo.SearchTagsTool(), tempoToolset.SearchTagsHandler) | ||
| mcpServer.AddTool(tempo.SearchTagValuesTool(), tempoToolset.SearchTagValuesHandler) |
There was a problem hiding this comment.
So I'm wondering if these tools can/should be behind a flag?
Ideally user has all o11y backends set up and receiving data, but Prometheus is still the default, always available thing at the moment. Even if a user has tempo, not all of their applications might be sending traces.
So it would be cool to give the user an option to pick the toolset to enable? Wdyt @andreasgerstmayr @iNecas?
There was a problem hiding this comment.
Yes, good point. We can add a CLI arg to enable/disable toolsets like kubernetes-mcp-server does.
pkg/tempo/discovery/sanitize.go
Outdated
| @@ -0,0 +1,34 @@ | |||
| // source: https://github.com/grafana/tempo-operator/blob/4ec7e8b4cb102f52de1e101174f4490b1505eaa1/internal/manifests/naming/sanitize.go | |||
There was a problem hiding this comment.
I'd be cautious of referencing AGPL code this way 😅
There are alternatives to this.
There was a problem hiding this comment.
Oh, sorry, I missed this. I'll rewrite this code snippet.
slashpai
left a comment
There was a problem hiding this comment.
A few nits, also unit tests and e2e tests for this would be great :)
I think you should be able to use manifests added in hack/tempo to setup, for prometheus and alertmanager we are using kube-prometheus not sure if something like that exists for tempo
Co-authored-by: Jayapriya Pai <slashpai9@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: andreasgerstmayr The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Thank you both for the reviews! |
Add Tempo tools:
tempo_list_instancestempo_search_tracestempo_get_trace_by_idtempo_search_tagstempo_search_tag_values