Skip to content

feat: improve mcp debugging by adding a steering prompt and api hooks#995

Open
leongdl wants to merge 2 commits intoaws-deadline:mainlinefrom
leongdl:mcp-debugging
Open

feat: improve mcp debugging by adding a steering prompt and api hooks#995
leongdl wants to merge 2 commits intoaws-deadline:mainlinefrom
leongdl:mcp-debugging

Conversation

@leongdl
Copy link
Contributor

@leongdl leongdl commented Feb 5, 2026

Fixes: N/A - Enhancement

What was the problem/requirement? (What/Why)

The MCP server needed tools to help AI assistants debug failed Deadline Cloud jobs. Users need to:

  1. Find failed jobs in their queues
  2. Identify which steps and tasks failed
  3. Retrieve session logs to diagnose root causes

Previously, the MCP server only had basic listing tools (farms, queues, jobs, fleets) but lacked the diagnostic primitives needed for effective troubleshooting.

What was the solution? (How)

Added six new primitive diagnostic tools to the MCP server:

  • search_jobs - Find jobs by status (FAILED, SUCCEEDED, etc.) and name
  • get_job - Get detailed job information including task counts
  • list_steps - List all steps in a job with their status
  • list_tasks - List all tasks in a step with their status
  • list_sessions - List all sessions for a job
  • get_session - Get session details including log configuration

Enhanced the MCP server's INSTRUCTIONS with a complete debugging workflow that guides AI assistants through:

  1. Searching for failed jobs
  2. Getting job details
  3. Listing steps to find failures
  4. Listing tasks in failed steps
  5. Getting sessions
  6. Retrieving logs (with AWS CLI fallback when MCP returns empty)

Updated the design document to reflect the primitive-only approach (removed the composite diagnose_failed_job tool that was initially planned but not implemented).

What is the impact of this change?

AI assistants using the MCP server can now:

  • Systematically debug failed Deadline Cloud jobs
  • Follow a guided workflow from job discovery to log analysis
  • Fall back to AWS CLI when CloudWatch logs aren't immediately available via MCP

This enables hands-free job debugging through conversational AI interfaces like Kiro.

How was this change tested?

Manual Testing via MCP:

  • Listed farms and queues successfully
  • Searched for failed jobs and found multiple results
  • Retrieved job details showing task failure counts
  • Listed 130+ sessions for a failed job
  • Attempted log retrieval (found logs were empty via MCP)
  • Successfully retrieved logs using AWS CLI fallback with --start-from-head
  • Diagnosed root cause: Conda package nuke=14.* not available in channel

Example interaction:

MCP Interaction Summary
Goal: Debug a failed Deadline Cloud job using the MCP server's diagnostic tools.

Workflow:

Listed Farms & Queues

Used deadline_list_farms to find farm ID
Used deadline_list_queues to find queue IDs
Searched for Failed Jobs

Used deadline_search_jobs with task_run_status="FAILED"
Found job job-2542af1188b0443881b87f02ab494eaf (Nuke render job with 21 failed tasks)
Listed Sessions

Used deadline_list_sessions to get all sessions for the failed job
Found 130+ sessions (many short-lived failures)
Attempted to Get Logs via MCP

Used deadline_get_session_logs on multiple sessions
All returned empty results (logs may have been at different stream positions or expired)
Fallback to AWS CLI

Used aws logs get-log-events with --start-from-head flag
Successfully retrieved logs from session session-74bedd83d99b4998bac49b6754ed8cdf
Root Cause Found
Error: PackagesNotFoundError: The following packages are not available from current channels: - nuke=14*

Explanation: The job tried to create a Conda environment with nuke=14.* from the deadline-cloud channel, but that package doesn't exist in the channel. The job parameters specified both Conda packages and Docker images, suggesting a configuration mismatch.

Key Learnings
MCP Tools Work Well for Discovery - Finding farms, queues, jobs, and sessions was straightforward
Log Retrieval Needs Fallback - When deadline_get_session_logs returns empty, use AWS CLI with --start-from-head
Updated Documentation - Added AWS CLI fallback instructions to server.py INSTRUCTIONS

The MCP server now teaches AI assistants the complete debugging workflow using primitive tools, with clear guidance on using AWS CLI when logs aren't immediately available.

Validation:

  • Ran hatch run lint - all checks passed
  • Ran hatch run fmt - code properly formatted
  • Verified MCP server restarts and loads new instructions

Was this change documented?

  • Relevant docstrings added to all new functions in _diagnostics.py
  • MCP server INSTRUCTIONS updated with complete debugging workflow
  • Design document docs/design/mcp-job-diagnostics.md updated to reflect primitive-only approach
  • README.md - No changes needed (MCP server usage is documented in docs/mcp_guide.md)

Does this PR introduce new dependencies?

  • This PR adds one or more new dependency Python packages. I acknowledge I have reviewed the considerations for adding dependencies in DEVELOPMENT.md.
  • This PR does not add any new dependencies.

Is this a breaking change?

No. This PR only adds new functionality:

  • New functions in private module _diagnostics.py (not part of public API)
  • New MCP tools registered in the server
  • No changes to existing public APIs or CLI commands

Does this change impact security?

No security impact:

  • Uses existing authentication via get_boto3_client
  • No new file system operations
  • No new network endpoints
  • Only reads data from AWS APIs (no writes)
  • CloudWatch log access uses existing queue credentials

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@github-actions github-actions bot added the waiting-on-maintainers Waiting on the maintainers to review. label Feb 5, 2026
@leongdl leongdl force-pushed the mcp-debugging branch 3 times, most recently from caa8f00 to 3eea5ea Compare February 6, 2026 03:02
Signed-off-by: David Leong <116610336+leongdl@users.noreply.github.com>
@leongdl leongdl marked this pull request as ready for review February 6, 2026 03:40
@leongdl leongdl requested a review from a team as a code owner February 6, 2026 03:40
@leongdl leongdl changed the title feat: improve mcp debugging feat: improve mcp debugging by adding a steering prompt and api hooks Feb 6, 2026
@@ -0,0 +1,277 @@
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.

"""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like the file name - open to change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its an _ private file, so maybe more of a two way door.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If its not supposed to be used outside this folder, keeping it _ is probably fine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok :)

--log-group-name "/aws/deadline/{farm_id}/{queue_id}" \
--log-stream-name "{session_id}" \
--start-from-head \
--limit 100 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Any reason to stop at 100 logs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly context saving. The mcp prompt has it set to keep checking until we find the problem.

Otherwise a LONG log will overflow the context.

## Key Concepts

- **Farm**: Top-level resource containing queues and fleets
- **Queue**: Where jobs are submitted and scheduled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: What about Fleet and Worker? I can image someone submitting Windows jobs to linux fleet on accident and having just a bit of context on could help

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, something we can add to extend this PR, But at this moment I did not add worker APIs here.

We'll need to add more use cases, like the render wrangler type of usage into this tool.

Comment on lines 31 to 38
while "nextToken" in response:
if max_results is not None and len(result[list_property_name]) >= max_results:
result[list_property_name] = result[list_property_name][:max_results]
break
response = list_api(nextToken=response["nextToken"], **kwargs)
result[list_property_name].extend(response[list_property_name])

if max_results is not None:
Copy link
Contributor

@viknith viknith Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already should support Max results as a parameter passed into list APIs, for example: https://docs.aws.amazon.com/deadline-cloud/latest/APIReference/API_ListFarms.html

So do we need this extra handling? We can just iterate whenever a nextToken has been returned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, let me fix that.

Comment on lines 220 to 223
if not farm_id:
farm_id = config_file.get_setting("defaults.farm_id", config=config)
if not farm_id:
raise ValueError("farm_id is required (not found in config defaults)")
Copy link
Contributor

@viknith viknith Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Micro-nit: can this be refactored to

farm_id = farm_id or config_file.get_setting("defaults.farm_id", config=config)
if not farm_id:
        raise ValueError("farm_id is required (not found in config defaults)")

Similar optimizations for the rest of the code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good - refactoring.

Signed-off-by: David Leong <116610336+leongdl@users.noreply.github.com>
from . import record_function_latency_telemetry_event

if TYPE_CHECKING:
from mypy_boto3_deadline import DeadlineClient

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'DeadlineClient' is not used.
@sonarqubecloud
Copy link

sonarqubecloud bot commented Feb 7, 2026

@leongdl leongdl enabled auto-merge (squash) February 7, 2026 03:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

waiting-on-maintainers Waiting on the maintainers to review.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants