feat: improve mcp debugging by adding a steering prompt and api hooks#995
feat: improve mcp debugging by adding a steering prompt and api hooks#995leongdl wants to merge 2 commits intoaws-deadline:mainlinefrom
Conversation
fa7a20f to
d8f4394
Compare
caa8f00 to
3eea5ea
Compare
Signed-off-by: David Leong <116610336+leongdl@users.noreply.github.com>
3eea5ea to
b5b6bc1
Compare
| @@ -0,0 +1,277 @@ | |||
| # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. | |||
|
|
|||
| """ | |||
There was a problem hiding this comment.
I don't really like the file name - open to change.
There was a problem hiding this comment.
Its an _ private file, so maybe more of a two way door.
There was a problem hiding this comment.
If its not supposed to be used outside this folder, keeping it _ is probably fine
| --log-group-name "/aws/deadline/{farm_id}/{queue_id}" \ | ||
| --log-stream-name "{session_id}" \ | ||
| --start-from-head \ | ||
| --limit 100 \ |
There was a problem hiding this comment.
Nit: Any reason to stop at 100 logs?
There was a problem hiding this comment.
Mostly context saving. The mcp prompt has it set to keep checking until we find the problem.
Otherwise a LONG log will overflow the context.
| ## Key Concepts | ||
|
|
||
| - **Farm**: Top-level resource containing queues and fleets | ||
| - **Queue**: Where jobs are submitted and scheduled |
There was a problem hiding this comment.
Nit: What about Fleet and Worker? I can image someone submitting Windows jobs to linux fleet on accident and having just a bit of context on could help
There was a problem hiding this comment.
Good idea, something we can add to extend this PR, But at this moment I did not add worker APIs here.
We'll need to add more use cases, like the render wrangler type of usage into this tool.
| while "nextToken" in response: | ||
| if max_results is not None and len(result[list_property_name]) >= max_results: | ||
| result[list_property_name] = result[list_property_name][:max_results] | ||
| break | ||
| response = list_api(nextToken=response["nextToken"], **kwargs) | ||
| result[list_property_name].extend(response[list_property_name]) | ||
|
|
||
| if max_results is not None: |
There was a problem hiding this comment.
We already should support Max results as a parameter passed into list APIs, for example: https://docs.aws.amazon.com/deadline-cloud/latest/APIReference/API_ListFarms.html
So do we need this extra handling? We can just iterate whenever a nextToken has been returned.
There was a problem hiding this comment.
Good point, let me fix that.
| if not farm_id: | ||
| farm_id = config_file.get_setting("defaults.farm_id", config=config) | ||
| if not farm_id: | ||
| raise ValueError("farm_id is required (not found in config defaults)") |
There was a problem hiding this comment.
Micro-nit: can this be refactored to
farm_id = farm_id or config_file.get_setting("defaults.farm_id", config=config)
if not farm_id:
raise ValueError("farm_id is required (not found in config defaults)")
Similar optimizations for the rest of the code
There was a problem hiding this comment.
Sounds good - refactoring.
Signed-off-by: David Leong <116610336+leongdl@users.noreply.github.com>
|



Fixes: N/A - Enhancement
What was the problem/requirement? (What/Why)
The MCP server needed tools to help AI assistants debug failed Deadline Cloud jobs. Users need to:
Previously, the MCP server only had basic listing tools (farms, queues, jobs, fleets) but lacked the diagnostic primitives needed for effective troubleshooting.
What was the solution? (How)
Added six new primitive diagnostic tools to the MCP server:
search_jobs- Find jobs by status (FAILED, SUCCEEDED, etc.) and nameget_job- Get detailed job information including task countslist_steps- List all steps in a job with their statuslist_tasks- List all tasks in a step with their statuslist_sessions- List all sessions for a jobget_session- Get session details including log configurationEnhanced the MCP server's
INSTRUCTIONSwith a complete debugging workflow that guides AI assistants through:Updated the design document to reflect the primitive-only approach (removed the composite
diagnose_failed_jobtool that was initially planned but not implemented).What is the impact of this change?
AI assistants using the MCP server can now:
This enables hands-free job debugging through conversational AI interfaces like Kiro.
How was this change tested?
Manual Testing via MCP:
--start-from-headnuke=14.*not available in channelExample interaction:
Validation:
hatch run lint- all checks passedhatch run fmt- code properly formattedWas this change documented?
_diagnostics.pyINSTRUCTIONSupdated with complete debugging workflowdocs/design/mcp-job-diagnostics.mdupdated to reflect primitive-only approachdocs/mcp_guide.md)Does this PR introduce new dependencies?
Is this a breaking change?
No. This PR only adds new functionality:
_diagnostics.py(not part of public API)Does this change impact security?
No security impact:
get_boto3_clientBy submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.