feat: improve mcp debugging by adding a steering prompt and api hooks by leongdl · Pull Request #995 · aws-deadline/deadline-cloud

leongdl · 2026-02-05T22:42:25Z

Fixes: N/A - Enhancement

What was the problem/requirement? (What/Why)

The MCP server needed tools to help AI assistants debug failed Deadline Cloud jobs. Users need to:

Find failed jobs in their queues
Identify which steps and tasks failed
Retrieve session logs to diagnose root causes

Previously, the MCP server only had basic listing tools (farms, queues, jobs, fleets) but lacked the diagnostic primitives needed for effective troubleshooting.

What was the solution? (How)

Added six new primitive diagnostic tools to the MCP server:

search_jobs - Find jobs by status (FAILED, SUCCEEDED, etc.) and name
get_job - Get detailed job information including task counts
list_steps - List all steps in a job with their status
list_tasks - List all tasks in a step with their status
list_sessions - List all sessions for a job
get_session - Get session details including log configuration

Enhanced the MCP server's INSTRUCTIONS with a complete debugging workflow that guides AI assistants through:

Searching for failed jobs
Getting job details
Listing steps to find failures
Listing tasks in failed steps
Getting sessions
Retrieving logs (with AWS CLI fallback when MCP returns empty)

Updated the design document to reflect the primitive-only approach (removed the composite diagnose_failed_job tool that was initially planned but not implemented).

What is the impact of this change?

AI assistants using the MCP server can now:

Systematically debug failed Deadline Cloud jobs
Follow a guided workflow from job discovery to log analysis
Fall back to AWS CLI when CloudWatch logs aren't immediately available via MCP

This enables hands-free job debugging through conversational AI interfaces like Kiro.

How was this change tested?

Manual Testing via MCP:

Listed farms and queues successfully
Searched for failed jobs and found multiple results
Retrieved job details showing task failure counts
Listed 130+ sessions for a failed job
Attempted log retrieval (found logs were empty via MCP)
Successfully retrieved logs using AWS CLI fallback with --start-from-head
Diagnosed root cause: Conda package nuke=14.* not available in channel

Example interaction:

MCP Interaction Summary
Goal: Debug a failed Deadline Cloud job using the MCP server's diagnostic tools.

Workflow:

Listed Farms & Queues

Used deadline_list_farms to find farm ID
Used deadline_list_queues to find queue IDs
Searched for Failed Jobs

Used deadline_search_jobs with task_run_status="FAILED"
Found job job-2542af1188b0443881b87f02ab494eaf (Nuke render job with 21 failed tasks)
Listed Sessions

Used deadline_list_sessions to get all sessions for the failed job
Found 130+ sessions (many short-lived failures)
Attempted to Get Logs via MCP

Used deadline_get_session_logs on multiple sessions
All returned empty results (logs may have been at different stream positions or expired)
Fallback to AWS CLI

Used aws logs get-log-events with --start-from-head flag
Successfully retrieved logs from session session-74bedd83d99b4998bac49b6754ed8cdf
Root Cause Found
Error: PackagesNotFoundError: The following packages are not available from current channels: - nuke=14*

Explanation: The job tried to create a Conda environment with nuke=14.* from the deadline-cloud channel, but that package doesn't exist in the channel. The job parameters specified both Conda packages and Docker images, suggesting a configuration mismatch.

Key Learnings
MCP Tools Work Well for Discovery - Finding farms, queues, jobs, and sessions was straightforward
Log Retrieval Needs Fallback - When deadline_get_session_logs returns empty, use AWS CLI with --start-from-head
Updated Documentation - Added AWS CLI fallback instructions to server.py INSTRUCTIONS

The MCP server now teaches AI assistants the complete debugging workflow using primitive tools, with clear guidance on using AWS CLI when logs aren't immediately available.

Validation:

Ran hatch run lint - all checks passed
Ran hatch run fmt - code properly formatted
Verified MCP server restarts and loads new instructions

Was this change documented?

Relevant docstrings added to all new functions in _diagnostics.py
MCP server INSTRUCTIONS updated with complete debugging workflow
Design document docs/design/mcp-job-diagnostics.md updated to reflect primitive-only approach
README.md - No changes needed (MCP server usage is documented in docs/mcp_guide.md)

Does this PR introduce new dependencies?

This PR adds one or more new dependency Python packages. I acknowledge I have reviewed the considerations for adding dependencies in DEVELOPMENT.md.
This PR does not add any new dependencies.

Is this a breaking change?

No. This PR only adds new functionality:

New functions in private module _diagnostics.py (not part of public API)
New MCP tools registered in the server
No changes to existing public APIs or CLI commands

Does this change impact security?

No security impact:

Uses existing authentication via get_boto3_client
No new file system operations
No new network endpoints
Only reads data from AWS APIs (no writes)
CloudWatch log access uses existing queue credentials

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

src/deadline/client/api/_mcp.py

Signed-off-by: David Leong <116610336+leongdl@users.noreply.github.com>

leongdl · 2026-02-06T19:21:25Z

src/deadline/client/api/_mcp.py

@@ -0,0 +1,277 @@
+# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+
+"""


I don't really like the file name - open to change.

Its an _ private file, so maybe more of a two way door.

If its not supposed to be used outside this folder, keeping it _ is probably fine

viknith · 2026-02-06T19:29:17Z

src/deadline/_mcp/server.py

+     --log-group-name "/aws/deadline/{farm_id}/{queue_id}" \
+     --log-stream-name "{session_id}" \
+     --start-from-head \
+     --limit 100 \


Nit: Any reason to stop at 100 logs?

Mostly context saving. The mcp prompt has it set to keep checking until we find the problem.

Otherwise a LONG log will overflow the context.

viknith · 2026-02-06T19:30:37Z

src/deadline/_mcp/server.py

+## Key Concepts
+
+- **Farm**: Top-level resource containing queues and fleets
+- **Queue**: Where jobs are submitted and scheduled


Nit: What about Fleet and Worker? I can image someone submitting Windows jobs to linux fleet on accident and having just a bit of context on could help

Good idea, something we can add to extend this PR, But at this moment I did not add worker APIs here.

We'll need to add more use cases, like the render wrangler type of usage into this tool.

viknith · 2026-02-06T19:32:18Z

src/deadline/client/api/_diagnostics.py

+    while "nextToken" in response:
+        if max_results is not None and len(result[list_property_name]) >= max_results:
+            result[list_property_name] = result[list_property_name][:max_results]
+            break
+        response = list_api(nextToken=response["nextToken"], **kwargs)
+        result[list_property_name].extend(response[list_property_name])
+
+    if max_results is not None:


We already should support Max results as a parameter passed into list APIs, for example: https://docs.aws.amazon.com/deadline-cloud/latest/APIReference/API_ListFarms.html

So do we need this extra handling? We can just iterate whenever a nextToken has been returned.

Good point, let me fix that.

viknith · 2026-02-06T19:33:50Z

src/deadline/client/api/_diagnostics.py

+    if not farm_id:
+        farm_id = config_file.get_setting("defaults.farm_id", config=config)
+    if not farm_id:
+        raise ValueError("farm_id is required (not found in config defaults)")


Micro-nit: can this be refactored to

farm_id = farm_id or config_file.get_setting("defaults.farm_id", config=config) if not farm_id: raise ValueError("farm_id is required (not found in config defaults)")

Similar optimizations for the rest of the code

Sounds good - refactoring.

Signed-off-by: David Leong <116610336+leongdl@users.noreply.github.com>

src/deadline/client/api/_mcp.py

+from . import record_function_latency_telemetry_event
+
+if TYPE_CHECKING:
+    from mypy_boto3_deadline import DeadlineClient


sonarqubecloud · 2026-02-07T03:20:49Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

github-actions bot added the waiting-on-maintainers Waiting on the maintainers to review. label Feb 5, 2026

leongdl force-pushed the mcp-debugging branch from fa7a20f to d8f4394 Compare February 5, 2026 23:53

github-advanced-security bot found potential problems Feb 5, 2026

View reviewed changes

src/deadline/client/api/_mcp.py Fixed Show fixed Hide fixed

leongdl force-pushed the mcp-debugging branch 3 times, most recently from caa8f00 to 3eea5ea Compare February 6, 2026 03:02

feat: Improve MCP server tool set

b5b6bc1

Signed-off-by: David Leong <116610336+leongdl@users.noreply.github.com>

leongdl force-pushed the mcp-debugging branch from 3eea5ea to b5b6bc1 Compare February 6, 2026 03:03

leongdl marked this pull request as ready for review February 6, 2026 03:40

leongdl requested a review from a team as a code owner February 6, 2026 03:40

leongdl changed the title ~~feat: improve mcp debugging~~ feat: improve mcp debugging by adding a steering prompt and api hooks Feb 6, 2026

leongdl commented Feb 6, 2026

View reviewed changes

viknith reviewed Feb 6, 2026

View reviewed changes

feat: Address code comments

ac54910

Signed-off-by: David Leong <116610336+leongdl@users.noreply.github.com>

github-advanced-security bot found potential problems Feb 7, 2026

View reviewed changes

src/deadline/client/api/_mcp.py

from . import record_function_latency_telemetry_event

if TYPE_CHECKING:

from mypy_boto3_deadline import DeadlineClient

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'DeadlineClient' is not used.

leongdl enabled auto-merge (squash) February 7, 2026 03:24

		@@ -0,0 +1,277 @@
		# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.

		"""

Conversation

leongdl commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What was the problem/requirement? (What/Why)

What was the solution? (How)

What is the impact of this change?

How was this change tested?

Was this change documented?

Does this PR introduce new dependencies?

Is this a breaking change?

Does this change impact security?

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viknith Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viknith Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Check notice

sonarqubecloud bot commented Feb 7, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

leongdl commented Feb 5, 2026 •

edited

Loading

viknith Feb 6, 2026 •

edited

Loading

viknith Feb 6, 2026 •

edited

Loading