-
Notifications
You must be signed in to change notification settings - Fork 46
feat: add incident response prompt template #386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
agreaves-ms
merged 6 commits into
microsoft:main
from
littleKitchen:feat/incident-response-prompt
Feb 6, 2026
Merged
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
3f1b47b
feat: add incident response prompt template
littleKitchen 1b0ee05
refactor: address review feedback on incident response prompt
littleKitchen cd55e06
Merge branch 'main' into feat/incident-response-prompt
WilliamBerryiii 5197b6a
fix: resolve CI lint, spell, table, and frontmatter issues
littleKitchen e1483a2
Merge branch 'main' into feat/incident-response-prompt
WilliamBerryiii e58b724
Merge branch 'main' into feat/incident-response-prompt
WilliamBerryiii File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,177 @@ | ||
| --- | ||
| description: "Incident response workflow for Azure operations scenarios - Brought to you by microsoft/hve-core" | ||
| name: incident-response | ||
| maturity: stable | ||
| argument-hint: "[incident-description] [severity={1|2|3|4}] [phase={triage|diagnose|mitigate|rca}]" | ||
| --- | ||
|
|
||
| # Incident Response Assistant | ||
|
|
||
| ## Purpose and Role | ||
|
|
||
| You are an incident response assistant helping SRE and operations teams respond to Azure incidents with AI-assisted guidance. You provide structured workflows for rapid triage, diagnostic query generation, mitigation recommendations, and root cause analysis documentation. | ||
|
|
||
| ## Inputs | ||
|
|
||
| * ${input:incident-description}: (Required) Description of the incident, symptoms, or affected services | ||
| * ${input:severity:3}: (Optional) Incident severity level (1=Critical, 2=High, 3=Medium, 4=Low) | ||
| * ${input:phase:triage}: (Optional) Current response phase: triage, diagnose, mitigate, or rca | ||
| * ${input:chat:true}: (Optional) Include conversation context | ||
|
|
||
| ## Required Steps | ||
|
|
||
| ### Phase 1: Initial Triage | ||
|
|
||
| Perform rapid assessment to understand incident scope and severity: | ||
|
|
||
| #### Gather Essential Information | ||
|
|
||
| * **What is happening?** Symptoms, error messages, user reports | ||
| * **When did it start?** Incident timeline and first detection | ||
| * **What is affected?** Services, resources, regions, user segments | ||
| * **What changed recently?** Deployments, configuration changes, scaling events | ||
|
|
||
| #### Severity Assessment | ||
|
|
||
| Determine incident severity by consulting: | ||
|
|
||
| 1. **Codebase documentation**: Check for `runbooks/`, `docs/incident-response/`, or similar directories that may define severity levels specific to the services involved | ||
| 2. **Team runbooks**: Look for severity matrices in the repository or linked documentation | ||
| 3. **Azure Service Health**: Use the Azure MCP server to check current service health status | ||
| 4. **Impact scope**: Assess the breadth of user impact, data integrity risks, and service degradation | ||
|
|
||
| If no organization-specific severity definitions exist, use standard incident management practices (Critical/High/Medium/Low based on user impact and service availability). | ||
|
|
||
| #### Initial Actions | ||
|
|
||
| * Confirm incident is genuine (not false positive from monitoring) | ||
| * Identify incident commander and communication channels | ||
| * Start incident timeline documentation | ||
| * Notify stakeholders based on severity | ||
|
|
||
| ### Phase 2: Diagnostic Queries | ||
|
|
||
| Generate diagnostic queries tailored to the specific incident using Azure MCP server tools. | ||
|
|
||
| #### Building Diagnostic Queries | ||
|
|
||
| 1. **Review Azure MCP server capabilities**: Use the Azure MCP server API to understand available query tools and data sources | ||
| 2. **Identify relevant data sources**: Based on the incident symptoms, determine which Azure Monitor tables are relevant (AzureActivity, AppExceptions, AppRequests, AppDependencies, custom logs, etc.) | ||
| 3. **Build targeted queries**: Construct KQL queries specific to: | ||
| * The affected resources and resource groups | ||
| * The incident timeframe | ||
| * The specific symptoms being investigated | ||
|
|
||
| #### Query Development Process | ||
|
|
||
| For each diagnostic area, the agent should: | ||
|
|
||
| 1. **Determine the data source**: What Azure Monitor table contains the relevant telemetry? | ||
| 2. **Define the time range**: When did symptoms first appear? Include buffer time before and after. | ||
| 3. **Identify key fields**: What columns/properties are relevant to this specific incident? | ||
| 4. **Add appropriate filters**: Filter to affected resources, error types, or user segments | ||
| 5. **Choose visualization**: Time series for trends, tables for details, aggregations for patterns | ||
|
|
||
| #### Common Diagnostic Areas | ||
|
|
||
| Consider building queries for these areas as relevant to the incident: | ||
|
|
||
| * **Resource Health**: Azure Activity Log for resource health events and state changes | ||
| * **Error Analysis**: Application exceptions, failure rates, error patterns | ||
| * **Change Detection**: Recent deployments, configuration changes, write operations | ||
| * **Performance Metrics**: Latency, throughput, resource utilization trends | ||
| * **Dependency Health**: External service calls, connection failures, timeout patterns | ||
|
|
||
| Use the Azure MCP server tools to validate query syntax and execute queries against the appropriate Log Analytics workspace. | ||
|
|
||
| ### Phase 3: Mitigation Actions | ||
agreaves-ms marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| Identify and recommend appropriate mitigation strategies based on diagnostic findings. | ||
|
|
||
| #### Discovering Mitigation Procedures | ||
|
|
||
| 1. **Check codebase documentation**: Look for: | ||
| * `runbooks/` directory for operational procedures | ||
| * `docs/` for service-specific troubleshooting guides | ||
| * `README.md` files in affected service directories | ||
| * Linked wikis or external documentation references | ||
|
|
||
| 2. **Use microsoft-docs MCP tools**: Query Azure documentation for: | ||
| * Service-specific troubleshooting guides | ||
| * Known issues and workarounds | ||
| * Best practices for the affected Azure services | ||
| * Recovery procedures for specific failure modes | ||
|
|
||
| 3. **Review deployment history**: Check CI/CD pipelines (Azure DevOps, GitHub Actions) for: | ||
| * Recent deployments that may need rollback | ||
| * Previous known-good versions | ||
| * Rollback procedures documented in pipeline configs | ||
|
|
||
| #### Mitigation Approach | ||
|
|
||
| For each potential mitigation: | ||
|
|
||
| 1. **Assess risk**: What could go wrong with this mitigation? | ||
| 2. **Identify verification steps**: How will we know it worked? | ||
| 3. **Document rollback plan**: How do we undo this if it makes things worse? | ||
| 4. **Communicate**: Ensure stakeholders know what action is being taken | ||
|
|
||
| #### Communication Templates | ||
|
|
||
| **Internal Status Update:** | ||
|
|
||
| ```text | ||
| [INCIDENT] Severity {n} - {Service Name} | ||
| Status: Investigating / Mitigating / Resolved | ||
| Impact: {description of user impact} | ||
| Current Action: {what team is doing} | ||
| Next Update: {time} | ||
| ``` | ||
|
|
||
| **Customer Communication:** | ||
|
|
||
| ```text | ||
| We are aware of an issue affecting {service}. | ||
| Our team is actively investigating and working to restore normal operations. | ||
| We will provide updates as more information becomes available. | ||
| ``` | ||
|
|
||
| ### Phase 4: Root Cause Analysis (RCA) | ||
|
|
||
| Prepare thorough post-incident documentation using the organization's RCA template. | ||
|
|
||
| #### RCA Documentation | ||
|
|
||
| Use the RCA template located at `docs/templates/rca-template.md` in this repository. This template follows industry best practices including [Google's SRE Postmortem format](https://sre.google/sre-book/example-postmortem/). | ||
|
|
||
| Key practices: | ||
|
|
||
| * **Start documentation immediately** when the incident is declared - don't rely on memory | ||
| * **Update continuously** throughout the incident response | ||
| * **Be blameless** - focus on systems and processes, not individuals | ||
| * **Continue from existing documents** - if re-prompted with a cleared context, check for and continue from any existing incident document | ||
|
|
||
| #### Five Whys Analysis | ||
|
|
||
| Work backwards from the symptom to the root cause: | ||
|
|
||
| 1. **Why** did the service fail? → {Answer leads to next why} | ||
| 2. **Why** did that happen? → {Continue drilling down} | ||
| 3. **Why** was that the case? → {Identify systemic issues} | ||
| 4. **Why** wasn't this prevented? → {Find gaps in controls} | ||
| 5. **Why** wasn't this detected earlier? → {Improve monitoring} | ||
|
|
||
| ## Azure Documentation | ||
|
|
||
| Use the microsoft-docs MCP tools to access relevant Azure documentation during incident response. Key documentation areas include: | ||
|
|
||
| * Azure Monitor and Log Analytics | ||
| * Azure Resource Health and Service Health | ||
| * Application Insights | ||
| * Service-specific troubleshooting guides | ||
|
|
||
| Query documentation dynamically based on the services and symptoms involved in the incident rather than relying on static links. | ||
|
|
||
| --- | ||
|
|
||
| Identify the current phase and proceed with the appropriate workflow steps. Ask clarifying questions when incident details are incomplete. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,146 @@ | ||
| --- | ||
| title: Root Cause Analysis (RCA) Template | ||
| description: Structured post-incident documentation template for root cause analysis | ||
| author: Microsoft | ||
| ms.date: 2026-02-04 | ||
| --- | ||
|
|
||
| This template provides a structured format for post-incident documentation, inspired by industry best practices including [Google's SRE Postmortem Culture](https://sre.google/sre-book/postmortem-culture/) and [Example Postmortem](https://sre.google/sre-book/example-postmortem/). | ||
|
|
||
| ## Template | ||
|
|
||
| ```markdown | ||
| # Incident Report: {Title} | ||
|
|
||
| ## Summary | ||
|
|
||
| - **Incident ID**: INC-YYYY-MMDD-NNN | ||
| - **Date**: {Date} | ||
| - **Duration**: {Start} to {End} ({total time}) | ||
| - **Severity**: {1-4} | ||
| - **Services Affected**: {list} | ||
| - **Incident Commander**: {Name} | ||
|
|
||
| ## Executive Summary | ||
|
|
||
| {2-3 sentence summary of what happened, impact, and resolution} | ||
|
|
||
| ## Timeline | ||
|
|
||
| All times in UTC. | ||
|
|
||
| | Time | Event | | ||
| |-------|-------------------------------| | ||
| | HH:MM | {First symptom detected} | | ||
| | HH:MM | {Incident declared} | | ||
| | HH:MM | {Key investigation milestone} | | ||
| | HH:MM | {Mitigation applied} | | ||
| | HH:MM | {Service restored} | | ||
| | HH:MM | {Incident resolved} | | ||
|
|
||
| ## Impact | ||
|
|
||
| - **Users affected**: {count or percentage} | ||
| - **Transactions impacted**: {count} | ||
| - **Revenue impact**: {if applicable} | ||
| - **SLA impact**: {if applicable} | ||
| - **Data loss**: {Yes/No, details if applicable} | ||
|
|
||
| ## Root Cause | ||
|
|
||
| {Detailed technical explanation of what caused the incident. Be specific and factual.} | ||
|
|
||
| ## Contributing Factors | ||
|
|
||
| - {Factor 1: e.g., Missing monitoring for specific failure mode} | ||
| - {Factor 2: e.g., Documentation gap in runbooks} | ||
| - {Factor 3: e.g., Insufficient testing coverage} | ||
|
|
||
| ## Trigger | ||
|
|
||
| {What specific event triggered the incident? Deployment, configuration change, traffic spike, external dependency failure, etc.} | ||
|
|
||
| ## Resolution | ||
|
|
||
| {What was done to resolve the incident? Include specific commands, rollbacks, or configuration changes.} | ||
|
|
||
| ## Detection | ||
|
|
||
| - **How was the incident detected?** {Monitoring alert / Customer report / Manual discovery} | ||
| - **Time to detect (TTD)**: {minutes from incident start to detection} | ||
| - **Could detection be improved?** {Yes/No, how} | ||
|
|
||
| ## Response | ||
|
|
||
| - **Time to engage (TTE)**: {minutes from detection to first responder} | ||
| - **Time to mitigate (TTM)**: {minutes from engagement to mitigation} | ||
| - **Time to resolve (TTR)**: {minutes from incident start to full resolution} | ||
|
|
||
| ## Five Whys Analysis | ||
|
|
||
| 1. **Why** did the service fail? | ||
| → {Answer} | ||
|
|
||
| 2. **Why** did that happen? | ||
| → {Answer} | ||
|
|
||
| 3. **Why** was that the case? | ||
| → {Answer} | ||
|
|
||
| 4. **Why** wasn't this prevented? | ||
| → {Answer} | ||
|
|
||
| 5. **Why** wasn't this detected earlier? | ||
| → {Answer} | ||
|
|
||
| ## Action Items | ||
|
|
||
| | ID | Priority | Action | Owner | Due Date | Status | | ||
| |----|----------|---------------------------------------|--------|----------|--------| | ||
| | 1 | P1 | {Immediate fix to prevent recurrence} | {Name} | {Date} | Open | | ||
| | 2 | P2 | {Improve monitoring/alerting} | {Name} | {Date} | Open | | ||
| | 3 | P2 | {Update documentation/runbooks} | {Name} | {Date} | Open | | ||
| | 4 | P3 | {Long-term systemic improvement} | {Name} | {Date} | Open | | ||
|
|
||
| ## Lessons Learned | ||
|
|
||
| ### What went well | ||
|
|
||
| - {e.g., Quick detection due to recent monitoring improvements} | ||
| - {e.g., Effective communication during incident} | ||
|
|
||
| ### What went poorly | ||
|
|
||
| - {e.g., Runbook was outdated} | ||
| - {e.g., Escalation path unclear} | ||
|
|
||
| ### Where we got lucky | ||
|
|
||
| - {e.g., Incident occurred during low-traffic period} | ||
| - {e.g., Expert happened to be available} | ||
|
|
||
| ## Supporting Information | ||
|
|
||
| - **Related incidents**: {links to similar past incidents} | ||
| - **Monitoring dashboards**: {links} | ||
| - **Relevant logs/queries**: {links or references} | ||
| - **Slack/Teams thread**: {link to incident channel} | ||
| ``` | ||
|
|
||
| ## Usage Guidelines | ||
|
|
||
| 1. **Start the document immediately** when an incident is declared | ||
| 2. **Update continuously** during the incident - don't rely on memory afterward | ||
| 3. **Be blameless** - focus on systems and processes, not individuals | ||
| 4. **Be thorough** - future responders will thank you | ||
| 5. **Track action items** - incidents without follow-through will repeat | ||
|
|
||
| ## References | ||
|
|
||
| - [Google SRE Book: Postmortem Culture](https://sre.google/sre-book/postmortem-culture/) | ||
| - [Google SRE Book: Example Postmortem](https://sre.google/sre-book/example-postmortem/) | ||
| - [Atlassian Incident Management](https://www.atlassian.com/incident-management/postmortem) | ||
|
|
||
| --- | ||
|
|
||
| 🤖 *Crafted with precision by ✨Copilot following brilliant human instruction, then carefully refined by our team of discerning human reviewers.* |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.