|
| 1 | +# MCP Job Diagnostics Tools Design |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +Extend the Deadline Cloud MCP server with tools to diagnose failed jobs: find failures, retrieve sessions, and fetch CloudWatch logs. |
| 6 | + |
| 7 | +## Problem Statement |
| 8 | + |
| 9 | +Users need to diagnose failed jobs but current MCP tools lack: |
| 10 | +1. Finding failed jobs in a queue |
| 11 | +2. Getting detailed job/session/step/task information |
| 12 | +3. Retrieving session logs from CloudWatch |
| 13 | + |
| 14 | +## Debugging Flow |
| 15 | + |
| 16 | +``` |
| 17 | +┌─────────────────────────────────────────────────────────────────────────┐ |
| 18 | +│ Job Failure Debugging Flow │ |
| 19 | +└─────────────────────────────────────────────────────────────────────────┘ |
| 20 | +
|
| 21 | +User: "Why did my job fail?" |
| 22 | + │ |
| 23 | + ▼ |
| 24 | +┌─────────────────────┐ |
| 25 | +│ search_jobs │ Find failed jobs in queue |
| 26 | +│ (status=FAILED) │ |
| 27 | +└─────────────────────┘ |
| 28 | + │ |
| 29 | + ▼ |
| 30 | +┌─────────────────────┐ |
| 31 | +│ diagnose_failed_job│ One-shot comprehensive diagnosis |
| 32 | +└─────────────────────┘ |
| 33 | + │ |
| 34 | + ├──────────────────────────────────────────┐ |
| 35 | + ▼ ▼ |
| 36 | +┌─────────────────────┐ ┌─────────────────────┐ |
| 37 | +│ Returns: │ │ OR Manual Flow: │ |
| 38 | +│ - Job status │ │ get_job │ |
| 39 | +│ - Failed steps │ │ ▼ │ |
| 40 | +│ - Failed tasks │ │ list_steps │ |
| 41 | +│ - Session logs │ │ ▼ │ |
| 42 | +│ - Error summary │ │ list_tasks │ |
| 43 | +└─────────────────────┘ │ ▼ │ |
| 44 | + │ list_sessions │ |
| 45 | + │ ▼ │ |
| 46 | + │ get_session │ |
| 47 | + └─────────────────────┘ |
| 48 | +``` |
| 49 | + |
| 50 | +## Tools Summary |
| 51 | + |
| 52 | +| Tool | Purpose | Key Inputs | Key Outputs | |
| 53 | +|------|---------|------------|-------------| |
| 54 | +| `search_jobs` | Find jobs by status/name | queueIds, taskRunStatus | Job list with status | |
| 55 | +| `get_job` | Get job details | jobId | Status, task counts | |
| 56 | +| `list_steps` | List job steps | jobId | Steps with status | |
| 57 | +| `list_tasks` | List step tasks | jobId, stepId | Tasks with runStatus | |
| 58 | +| `list_sessions` | List job sessions | jobId | Session summaries | |
| 59 | +| `get_session` | Get session details | sessionId | Log config, worker info | |
| 60 | +| `diagnose_failed_job` | Full diagnosis | jobId | Job + steps + tasks + logs | |
| 61 | + |
| 62 | +--- |
| 63 | + |
| 64 | +## Tier 1: Primitive APIs |
| 65 | + |
| 66 | +### search_jobs |
| 67 | + |
| 68 | +Find jobs with optional filters. |
| 69 | + |
| 70 | +```python |
| 71 | +def search_jobs( |
| 72 | + farm_id: str, |
| 73 | + queue_ids: List[str], |
| 74 | + task_run_status: Optional[str] = None, # PENDING|READY|RUNNING|FAILED|SUCCEEDED |
| 75 | + name_contains: Optional[str] = None, |
| 76 | + page_size: Optional[int] = 25, # 1-100 |
| 77 | + item_offset: Optional[int] = 0, # 0-10000 |
| 78 | +) -> Dict[str, Any] |
| 79 | +``` |
| 80 | + |
| 81 | +**Output:** |
| 82 | +```json |
| 83 | +{"jobs": [...], "totalResults": 10, "nextItemOffset": 25} |
| 84 | +``` |
| 85 | + |
| 86 | +### get_job |
| 87 | + |
| 88 | +Get detailed job information. |
| 89 | + |
| 90 | +```python |
| 91 | +def get_job(farm_id: str, queue_id: str, job_id: str) -> Dict[str, Any] |
| 92 | +``` |
| 93 | + |
| 94 | +**Output:** Job details including name, status, taskRunStatusCounts, timestamps. |
| 95 | + |
| 96 | +### list_steps |
| 97 | + |
| 98 | +List all steps for a job. |
| 99 | + |
| 100 | +```python |
| 101 | +def list_steps(farm_id: str, queue_id: str, job_id: str) -> Dict[str, Any] |
| 102 | +``` |
| 103 | + |
| 104 | +**Output:** |
| 105 | +```json |
| 106 | +{"steps": [{"stepId": "...", "name": "...", "taskRunStatus": "...", "taskRunStatusCounts": {...}}]} |
| 107 | +``` |
| 108 | + |
| 109 | +### list_tasks |
| 110 | + |
| 111 | +List all tasks for a step. |
| 112 | + |
| 113 | +```python |
| 114 | +def list_tasks(farm_id: str, queue_id: str, job_id: str, step_id: str) -> Dict[str, Any] |
| 115 | +``` |
| 116 | + |
| 117 | +**Output:** |
| 118 | +```json |
| 119 | +{"tasks": [{"taskId": "...", "runStatus": "...", "parameters": {...}}]} |
| 120 | +``` |
| 121 | + |
| 122 | +### list_sessions |
| 123 | + |
| 124 | +List all sessions for a job. |
| 125 | + |
| 126 | +```python |
| 127 | +def list_sessions(farm_id: str, queue_id: str, job_id: str) -> Dict[str, Any] |
| 128 | +``` |
| 129 | + |
| 130 | +**Output:** |
| 131 | +```json |
| 132 | +{"sessions": [{"sessionId": "...", "lifecycleStatus": "...", "workerId": "..."}]} |
| 133 | +``` |
| 134 | + |
| 135 | +### get_session |
| 136 | + |
| 137 | +Get detailed session information. |
| 138 | + |
| 139 | +```python |
| 140 | +def get_session(farm_id: str, queue_id: str, job_id: str, session_id: str) -> Dict[str, Any] |
| 141 | +``` |
| 142 | + |
| 143 | +**Output:** Session details including lifecycleStatus, log configuration, worker info. |
| 144 | + |
| 145 | +--- |
| 146 | + |
| 147 | +## Tier 2: Composite Tool |
| 148 | + |
| 149 | +### diagnose_failed_job |
| 150 | + |
| 151 | +One-shot comprehensive failure analysis. |
| 152 | + |
| 153 | +```python |
| 154 | +def diagnose_failed_job( |
| 155 | + job_id: str, |
| 156 | + farm_id: Optional[str] = None, # Uses default if not provided |
| 157 | + queue_id: Optional[str] = None, # Uses default if not provided |
| 158 | + max_sessions: int = 5, |
| 159 | + max_log_lines: int = 100, |
| 160 | +) -> Dict[str, Any] |
| 161 | +``` |
| 162 | + |
| 163 | +**Workflow:** |
| 164 | +1. Get job details and status |
| 165 | +2. List steps → identify failed ones |
| 166 | +3. List tasks in failed steps → identify failed tasks |
| 167 | +4. List sessions → fetch CloudWatch logs |
| 168 | + |
| 169 | +**Output:** |
| 170 | +```json |
| 171 | +{ |
| 172 | + "job": {"jobId": "...", "name": "...", "taskRunStatus": "FAILED", "taskRunStatusCounts": {...}}, |
| 173 | + "failed_steps": [{"stepId": "...", "name": "...", "taskRunStatus": "FAILED"}], |
| 174 | + "failed_tasks": [{"stepId": "...", "taskId": "...", "runStatus": "FAILED", "parameters": {...}}], |
| 175 | + "sessions": [{"sessionId": "...", "logs": [{"timestamp": "...", "message": "Error: ..."}]}], |
| 176 | + "summary": {"total_tasks": 100, "failed_tasks": 2, "diagnosis": "Job failed with 2 failed tasks..."} |
| 177 | +} |
| 178 | +``` |
| 179 | + |
| 180 | +--- |
| 181 | + |
| 182 | +## Usage Examples |
| 183 | + |
| 184 | +### Find failed jobs |
| 185 | +``` |
| 186 | +User: "Show me all failed jobs" |
| 187 | +Tool: search_jobs(queueIds=["queue-xxx"], taskRunStatus="FAILED") |
| 188 | +``` |
| 189 | + |
| 190 | +### Diagnose a failure |
| 191 | +``` |
| 192 | +User: "Why did job-111 fail?" |
| 193 | +Tool: diagnose_failed_job(jobId="job-111") |
| 194 | +→ Returns job status, failed tasks, session logs, error summary |
| 195 | +``` |
| 196 | + |
| 197 | +--- |
| 198 | + |
| 199 | +## Appendix |
| 200 | + |
| 201 | +### A. Deadline Cloud APIs Used |
| 202 | + |
| 203 | +| API | Purpose | |
| 204 | +|-----|---------| |
| 205 | +| `GetJob` | Get job status and task counts | |
| 206 | +| `GetSession` | Get log stream configuration | |
| 207 | +| `ListSessions` | Find sessions for job | |
| 208 | +| `ListSteps` | Find failed steps | |
| 209 | +| `ListTasks` | Find failed tasks | |
| 210 | +| `SearchJobs` | Filter jobs by status | |
| 211 | + |
| 212 | +### B. CloudWatch Logs |
| 213 | + |
| 214 | +Session logs location: |
| 215 | +- Log Group: `/aws/deadline/{farmId}/{queueId}` |
| 216 | +- Log Stream: `{sessionId}` |
| 217 | + |
| 218 | +### C. File Structure |
| 219 | + |
| 220 | +``` |
| 221 | +src/deadline/_mcp/tools/ |
| 222 | +├── diagnostics.py # diagnose_failed_job |
| 223 | + |
| 224 | +src/deadline/client/api/ |
| 225 | +├── _list_apis.py # list_sessions, list_steps, list_tasks |
| 226 | +├── _get_apis.py # get_job, get_session |
| 227 | +└── _search_apis.py # search_jobs |
| 228 | +``` |
| 229 | + |
| 230 | +### D. Registry Configuration |
| 231 | + |
| 232 | +```python |
| 233 | +TOOL_REGISTRY = { |
| 234 | + "get_job": {"func": api.get_job, "param_names": ["farmId", "queueId", "jobId"]}, |
| 235 | + "get_session": {"func": api.get_session, "param_names": ["farmId", "queueId", "jobId", "sessionId"]}, |
| 236 | + "list_sessions": {"func": api.list_sessions, "param_names": ["farmId", "queueId", "jobId"]}, |
| 237 | + "list_steps": {"func": api.list_steps, "param_names": ["farmId", "queueId", "jobId"]}, |
| 238 | + "list_tasks": {"func": api.list_tasks, "param_names": ["farmId", "queueId", "jobId", "stepId"]}, |
| 239 | + "search_jobs": {"func": api.search_jobs, "param_names": ["farmId", "queueIds", "taskRunStatus", "nameContains", "pageSize", "itemOffset"]}, |
| 240 | + "diagnose_failed_job": {"func": diagnostics.diagnose_failed_job, "param_names": None}, |
| 241 | +} |
| 242 | +``` |
| 243 | + |
| 244 | +### E. Security |
| 245 | + |
| 246 | +- Uses existing authentication via `get_boto3_client` |
| 247 | +- Queue credentials via `get_queue_user_boto3_session` for CloudWatch |
| 248 | +- No new IAM permissions required |
| 249 | + |
| 250 | +### F. Testing Strategy |
| 251 | + |
| 252 | +**Unit Tests:** |
| 253 | +- Mock boto3 responses for each API |
| 254 | +- Test pagination, error handling |
| 255 | +- Test `diagnose_failed_job` with various failure scenarios |
| 256 | + |
| 257 | +**Integration Tests:** |
| 258 | +- Submit intentionally failing job, then diagnose |
| 259 | + |
| 260 | +### G. Future Enhancements |
| 261 | + |
| 262 | +- Worker diagnostics (`get_worker`, `list_workers`) |
| 263 | +- Log filtering/grep capability |
| 264 | +- Time-based job search |
| 265 | +- Batch diagnostics |
| 266 | +- Export to file |
| 267 | + |
| 268 | +### H. References |
| 269 | + |
| 270 | +- [AWS Deadline Cloud API Reference](https://docs.aws.amazon.com/deadline-cloud/latest/APIReference/Welcome.html) |
| 271 | +- [SearchJobs API](https://docs.aws.amazon.com/deadline-cloud/latest/APIReference/API_SearchJobs.html) |
| 272 | +- [GetSession API](https://docs.aws.amazon.com/deadline-cloud/latest/APIReference/API_GetSession.html) |
0 commit comments