Skip to content

Commit caa8f00

Browse files
committed
feat: Improve MCP server tool set
Signed-off-by: David Leong <116610336+leongdl@users.noreply.github.com>
1 parent 5bba469 commit caa8f00

File tree

5 files changed

+1078
-0
lines changed

5 files changed

+1078
-0
lines changed

docs/design/mcp-job-diagnostics.md

Lines changed: 272 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,272 @@
1+
# MCP Job Diagnostics Tools Design
2+
3+
## Overview
4+
5+
Extend the Deadline Cloud MCP server with tools to diagnose failed jobs: find failures, retrieve sessions, and fetch CloudWatch logs.
6+
7+
## Problem Statement
8+
9+
Users need to diagnose failed jobs but current MCP tools lack:
10+
1. Finding failed jobs in a queue
11+
2. Getting detailed job/session/step/task information
12+
3. Retrieving session logs from CloudWatch
13+
14+
## Debugging Flow
15+
16+
```
17+
┌─────────────────────────────────────────────────────────────────────────┐
18+
│ Job Failure Debugging Flow │
19+
└─────────────────────────────────────────────────────────────────────────┘
20+
21+
User: "Why did my job fail?"
22+
23+
24+
┌─────────────────────┐
25+
│ search_jobs │ Find failed jobs in queue
26+
│ (status=FAILED) │
27+
└─────────────────────┘
28+
29+
30+
┌─────────────────────┐
31+
│ diagnose_failed_job│ One-shot comprehensive diagnosis
32+
└─────────────────────┘
33+
34+
├──────────────────────────────────────────┐
35+
▼ ▼
36+
┌─────────────────────┐ ┌─────────────────────┐
37+
│ Returns: │ │ OR Manual Flow: │
38+
│ - Job status │ │ get_job │
39+
│ - Failed steps │ │ ▼ │
40+
│ - Failed tasks │ │ list_steps │
41+
│ - Session logs │ │ ▼ │
42+
│ - Error summary │ │ list_tasks │
43+
└─────────────────────┘ │ ▼ │
44+
│ list_sessions │
45+
│ ▼ │
46+
│ get_session │
47+
└─────────────────────┘
48+
```
49+
50+
## Tools Summary
51+
52+
| Tool | Purpose | Key Inputs | Key Outputs |
53+
|------|---------|------------|-------------|
54+
| `search_jobs` | Find jobs by status/name | queueIds, taskRunStatus | Job list with status |
55+
| `get_job` | Get job details | jobId | Status, task counts |
56+
| `list_steps` | List job steps | jobId | Steps with status |
57+
| `list_tasks` | List step tasks | jobId, stepId | Tasks with runStatus |
58+
| `list_sessions` | List job sessions | jobId | Session summaries |
59+
| `get_session` | Get session details | sessionId | Log config, worker info |
60+
| `diagnose_failed_job` | Full diagnosis | jobId | Job + steps + tasks + logs |
61+
62+
---
63+
64+
## Tier 1: Primitive APIs
65+
66+
### search_jobs
67+
68+
Find jobs with optional filters.
69+
70+
```python
71+
def search_jobs(
72+
farm_id: str,
73+
queue_ids: List[str],
74+
task_run_status: Optional[str] = None, # PENDING|READY|RUNNING|FAILED|SUCCEEDED
75+
name_contains: Optional[str] = None,
76+
page_size: Optional[int] = 25, # 1-100
77+
item_offset: Optional[int] = 0, # 0-10000
78+
) -> Dict[str, Any]
79+
```
80+
81+
**Output:**
82+
```json
83+
{"jobs": [...], "totalResults": 10, "nextItemOffset": 25}
84+
```
85+
86+
### get_job
87+
88+
Get detailed job information.
89+
90+
```python
91+
def get_job(farm_id: str, queue_id: str, job_id: str) -> Dict[str, Any]
92+
```
93+
94+
**Output:** Job details including name, status, taskRunStatusCounts, timestamps.
95+
96+
### list_steps
97+
98+
List all steps for a job.
99+
100+
```python
101+
def list_steps(farm_id: str, queue_id: str, job_id: str) -> Dict[str, Any]
102+
```
103+
104+
**Output:**
105+
```json
106+
{"steps": [{"stepId": "...", "name": "...", "taskRunStatus": "...", "taskRunStatusCounts": {...}}]}
107+
```
108+
109+
### list_tasks
110+
111+
List all tasks for a step.
112+
113+
```python
114+
def list_tasks(farm_id: str, queue_id: str, job_id: str, step_id: str) -> Dict[str, Any]
115+
```
116+
117+
**Output:**
118+
```json
119+
{"tasks": [{"taskId": "...", "runStatus": "...", "parameters": {...}}]}
120+
```
121+
122+
### list_sessions
123+
124+
List all sessions for a job.
125+
126+
```python
127+
def list_sessions(farm_id: str, queue_id: str, job_id: str) -> Dict[str, Any]
128+
```
129+
130+
**Output:**
131+
```json
132+
{"sessions": [{"sessionId": "...", "lifecycleStatus": "...", "workerId": "..."}]}
133+
```
134+
135+
### get_session
136+
137+
Get detailed session information.
138+
139+
```python
140+
def get_session(farm_id: str, queue_id: str, job_id: str, session_id: str) -> Dict[str, Any]
141+
```
142+
143+
**Output:** Session details including lifecycleStatus, log configuration, worker info.
144+
145+
---
146+
147+
## Tier 2: Composite Tool
148+
149+
### diagnose_failed_job
150+
151+
One-shot comprehensive failure analysis.
152+
153+
```python
154+
def diagnose_failed_job(
155+
job_id: str,
156+
farm_id: Optional[str] = None, # Uses default if not provided
157+
queue_id: Optional[str] = None, # Uses default if not provided
158+
max_sessions: int = 5,
159+
max_log_lines: int = 100,
160+
) -> Dict[str, Any]
161+
```
162+
163+
**Workflow:**
164+
1. Get job details and status
165+
2. List steps → identify failed ones
166+
3. List tasks in failed steps → identify failed tasks
167+
4. List sessions → fetch CloudWatch logs
168+
169+
**Output:**
170+
```json
171+
{
172+
"job": {"jobId": "...", "name": "...", "taskRunStatus": "FAILED", "taskRunStatusCounts": {...}},
173+
"failed_steps": [{"stepId": "...", "name": "...", "taskRunStatus": "FAILED"}],
174+
"failed_tasks": [{"stepId": "...", "taskId": "...", "runStatus": "FAILED", "parameters": {...}}],
175+
"sessions": [{"sessionId": "...", "logs": [{"timestamp": "...", "message": "Error: ..."}]}],
176+
"summary": {"total_tasks": 100, "failed_tasks": 2, "diagnosis": "Job failed with 2 failed tasks..."}
177+
}
178+
```
179+
180+
---
181+
182+
## Usage Examples
183+
184+
### Find failed jobs
185+
```
186+
User: "Show me all failed jobs"
187+
Tool: search_jobs(queueIds=["queue-xxx"], taskRunStatus="FAILED")
188+
```
189+
190+
### Diagnose a failure
191+
```
192+
User: "Why did job-111 fail?"
193+
Tool: diagnose_failed_job(jobId="job-111")
194+
→ Returns job status, failed tasks, session logs, error summary
195+
```
196+
197+
---
198+
199+
## Appendix
200+
201+
### A. Deadline Cloud APIs Used
202+
203+
| API | Purpose |
204+
|-----|---------|
205+
| `GetJob` | Get job status and task counts |
206+
| `GetSession` | Get log stream configuration |
207+
| `ListSessions` | Find sessions for job |
208+
| `ListSteps` | Find failed steps |
209+
| `ListTasks` | Find failed tasks |
210+
| `SearchJobs` | Filter jobs by status |
211+
212+
### B. CloudWatch Logs
213+
214+
Session logs location:
215+
- Log Group: `/aws/deadline/{farmId}/{queueId}`
216+
- Log Stream: `{sessionId}`
217+
218+
### C. File Structure
219+
220+
```
221+
src/deadline/_mcp/tools/
222+
├── diagnostics.py # diagnose_failed_job
223+
224+
src/deadline/client/api/
225+
├── _list_apis.py # list_sessions, list_steps, list_tasks
226+
├── _get_apis.py # get_job, get_session
227+
└── _search_apis.py # search_jobs
228+
```
229+
230+
### D. Registry Configuration
231+
232+
```python
233+
TOOL_REGISTRY = {
234+
"get_job": {"func": api.get_job, "param_names": ["farmId", "queueId", "jobId"]},
235+
"get_session": {"func": api.get_session, "param_names": ["farmId", "queueId", "jobId", "sessionId"]},
236+
"list_sessions": {"func": api.list_sessions, "param_names": ["farmId", "queueId", "jobId"]},
237+
"list_steps": {"func": api.list_steps, "param_names": ["farmId", "queueId", "jobId"]},
238+
"list_tasks": {"func": api.list_tasks, "param_names": ["farmId", "queueId", "jobId", "stepId"]},
239+
"search_jobs": {"func": api.search_jobs, "param_names": ["farmId", "queueIds", "taskRunStatus", "nameContains", "pageSize", "itemOffset"]},
240+
"diagnose_failed_job": {"func": diagnostics.diagnose_failed_job, "param_names": None},
241+
}
242+
```
243+
244+
### E. Security
245+
246+
- Uses existing authentication via `get_boto3_client`
247+
- Queue credentials via `get_queue_user_boto3_session` for CloudWatch
248+
- No new IAM permissions required
249+
250+
### F. Testing Strategy
251+
252+
**Unit Tests:**
253+
- Mock boto3 responses for each API
254+
- Test pagination, error handling
255+
- Test `diagnose_failed_job` with various failure scenarios
256+
257+
**Integration Tests:**
258+
- Submit intentionally failing job, then diagnose
259+
260+
### G. Future Enhancements
261+
262+
- Worker diagnostics (`get_worker`, `list_workers`)
263+
- Log filtering/grep capability
264+
- Time-based job search
265+
- Batch diagnostics
266+
- Export to file
267+
268+
### H. References
269+
270+
- [AWS Deadline Cloud API Reference](https://docs.aws.amazon.com/deadline-cloud/latest/APIReference/Welcome.html)
271+
- [SearchJobs API](https://docs.aws.amazon.com/deadline-cloud/latest/APIReference/API_SearchJobs.html)
272+
- [GetSession API](https://docs.aws.amazon.com/deadline-cloud/latest/APIReference/API_GetSession.html)

src/deadline/_mcp/registry.py

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,4 +79,36 @@ def get_all_tool_names() -> List[str]:
7979
"func": job.download_job_output,
8080
"param_names": None,
8181
},
82+
# Diagnostics - Primitive APIs
83+
"get_job": {
84+
"func": api.get_job,
85+
"param_names": ["farm_id", "queue_id", "job_id"],
86+
},
87+
"get_session": {
88+
"func": api.get_session,
89+
"param_names": ["farm_id", "queue_id", "job_id", "session_id"],
90+
},
91+
"list_sessions": {
92+
"func": api.list_sessions,
93+
"param_names": ["farm_id", "queue_id", "job_id", "max_results"],
94+
},
95+
"list_steps": {
96+
"func": api.list_steps,
97+
"param_names": ["farm_id", "queue_id", "job_id", "max_results"],
98+
},
99+
"list_tasks": {
100+
"func": api.list_tasks,
101+
"param_names": ["farm_id", "queue_id", "job_id", "step_id", "max_results"],
102+
},
103+
"search_jobs": {
104+
"func": api.search_jobs,
105+
"param_names": [
106+
"farm_id",
107+
"queue_ids",
108+
"task_run_status",
109+
"name_contains",
110+
"page_size",
111+
"item_offset",
112+
],
113+
},
82114
}

src/deadline/client/api/__init__.py

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,13 @@
5353
"get_session_logs",
5454
"SessionLogResult",
5555
"LogEvent",
56+
# Diagnostics APIs
57+
"get_job",
58+
"get_session",
59+
"list_sessions",
60+
"list_steps",
61+
"list_tasks",
62+
"search_jobs",
5663
]
5764

5865
# The following import is needed to prevent the following sporadic failure:
@@ -110,6 +117,14 @@
110117
SessionLogResult,
111118
LogEvent,
112119
)
120+
from ._diagnostics import (
121+
get_job,
122+
get_session,
123+
list_sessions,
124+
list_steps,
125+
list_tasks,
126+
search_jobs,
127+
)
113128

114129
logger = getLogger(__name__)
115130

0 commit comments

Comments
 (0)