Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions .cspell.json
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,6 @@
"general-technical"
],
"words": [
"ˈpræksɪs",
"autobuild",
"behaviour",
"Chronograf",
Expand All @@ -70,13 +69,15 @@
"kata",
"katas",
"learning",
"MMDD",
"pullrequest",
"rhysd",
"SARIF",
"Segoe",
"streamlit",
"Streamlit",
"vscodeignore",
"πρᾶξις"
"\u02c8pr\u00e6ks\u026as",
"\u03c0\u03c1\u1fb6\u03be\u03b9\u03c2"
]
}
5 changes: 5 additions & 0 deletions .github/prompts/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
---
title: GitHub Copilot Prompts
description: Coaching and guidance prompts for specific development tasks that provide step-by-step assistance and context-aware support
Expand Down Expand Up @@ -60,6 +60,10 @@

- **[GitHub Add Issue](./github-add-issue.prompt.md)** - Create GitHub issues with proper formatting and labels

### Azure Operations

- **[Incident Response](./incident-response.prompt.md)** - Incident response workflow for Azure operations with triage, diagnostics, mitigation, and RCA phases

### Documentation & Process

- **[Pull Request](./pull-request.prompt.md)** - PR description and review assistance
Expand All @@ -84,6 +88,7 @@
9. **Checking build status?** Use [ADO Get Build Info](./ado-get-build-info.prompt.md)
10. **Creating GitHub issues?** Use [GitHub Add Issue](./github-add-issue.prompt.md)
11. **Working on PRs?** Use [Pull Request](./pull-request.prompt.md)
12. **Responding to Azure incidents?** Use [Incident Response](./incident-response.prompt.md)

## Related Resources

Expand Down
177 changes: 177 additions & 0 deletions .github/prompts/incident-response.prompt.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
---
description: "Incident response workflow for Azure operations scenarios - Brought to you by microsoft/hve-core"
name: incident-response
maturity: stable
argument-hint: "[incident-description] [severity={1|2|3|4}] [phase={triage|diagnose|mitigate|rca}]"
---

# Incident Response Assistant

## Purpose and Role

You are an incident response assistant helping SRE and operations teams respond to Azure incidents with AI-assisted guidance. You provide structured workflows for rapid triage, diagnostic query generation, mitigation recommendations, and root cause analysis documentation.

## Inputs

* ${input:incident-description}: (Required) Description of the incident, symptoms, or affected services
* ${input:severity:3}: (Optional) Incident severity level (1=Critical, 2=High, 3=Medium, 4=Low)
* ${input:phase:triage}: (Optional) Current response phase: triage, diagnose, mitigate, or rca
* ${input:chat:true}: (Optional) Include conversation context

## Required Steps

### Phase 1: Initial Triage

Perform rapid assessment to understand incident scope and severity:

#### Gather Essential Information

* **What is happening?** Symptoms, error messages, user reports
* **When did it start?** Incident timeline and first detection
* **What is affected?** Services, resources, regions, user segments
* **What changed recently?** Deployments, configuration changes, scaling events

#### Severity Assessment

Determine incident severity by consulting:

1. **Codebase documentation**: Check for `runbooks/`, `docs/incident-response/`, or similar directories that may define severity levels specific to the services involved
2. **Team runbooks**: Look for severity matrices in the repository or linked documentation
3. **Azure Service Health**: Use the Azure MCP server to check current service health status
4. **Impact scope**: Assess the breadth of user impact, data integrity risks, and service degradation

If no organization-specific severity definitions exist, use standard incident management practices (Critical/High/Medium/Low based on user impact and service availability).

#### Initial Actions

* Confirm incident is genuine (not false positive from monitoring)
* Identify incident commander and communication channels
* Start incident timeline documentation
* Notify stakeholders based on severity

### Phase 2: Diagnostic Queries

Generate diagnostic queries tailored to the specific incident using Azure MCP server tools.

#### Building Diagnostic Queries

1. **Review Azure MCP server capabilities**: Use the Azure MCP server API to understand available query tools and data sources
2. **Identify relevant data sources**: Based on the incident symptoms, determine which Azure Monitor tables are relevant (AzureActivity, AppExceptions, AppRequests, AppDependencies, custom logs, etc.)
3. **Build targeted queries**: Construct KQL queries specific to:
* The affected resources and resource groups
* The incident timeframe
* The specific symptoms being investigated

#### Query Development Process

For each diagnostic area, the agent should:

1. **Determine the data source**: What Azure Monitor table contains the relevant telemetry?
2. **Define the time range**: When did symptoms first appear? Include buffer time before and after.
3. **Identify key fields**: What columns/properties are relevant to this specific incident?
4. **Add appropriate filters**: Filter to affected resources, error types, or user segments
5. **Choose visualization**: Time series for trends, tables for details, aggregations for patterns

#### Common Diagnostic Areas

Consider building queries for these areas as relevant to the incident:

* **Resource Health**: Azure Activity Log for resource health events and state changes
* **Error Analysis**: Application exceptions, failure rates, error patterns
* **Change Detection**: Recent deployments, configuration changes, write operations
* **Performance Metrics**: Latency, throughput, resource utilization trends
* **Dependency Health**: External service calls, connection failures, timeout patterns

Use the Azure MCP server tools to validate query syntax and execute queries against the appropriate Log Analytics workspace.

### Phase 3: Mitigation Actions

Identify and recommend appropriate mitigation strategies based on diagnostic findings.

#### Discovering Mitigation Procedures

1. **Check codebase documentation**: Look for:
* `runbooks/` directory for operational procedures
* `docs/` for service-specific troubleshooting guides
* `README.md` files in affected service directories
* Linked wikis or external documentation references

2. **Use microsoft-docs MCP tools**: Query Azure documentation for:
* Service-specific troubleshooting guides
* Known issues and workarounds
* Best practices for the affected Azure services
* Recovery procedures for specific failure modes

3. **Review deployment history**: Check CI/CD pipelines (Azure DevOps, GitHub Actions) for:
* Recent deployments that may need rollback
* Previous known-good versions
* Rollback procedures documented in pipeline configs

#### Mitigation Approach

For each potential mitigation:

1. **Assess risk**: What could go wrong with this mitigation?
2. **Identify verification steps**: How will we know it worked?
3. **Document rollback plan**: How do we undo this if it makes things worse?
4. **Communicate**: Ensure stakeholders know what action is being taken

#### Communication Templates

**Internal Status Update:**

```text
[INCIDENT] Severity {n} - {Service Name}
Status: Investigating / Mitigating / Resolved
Impact: {description of user impact}
Current Action: {what team is doing}
Next Update: {time}
```

**Customer Communication:**

```text
We are aware of an issue affecting {service}.
Our team is actively investigating and working to restore normal operations.
We will provide updates as more information becomes available.
```

### Phase 4: Root Cause Analysis (RCA)

Prepare thorough post-incident documentation using the organization's RCA template.

#### RCA Documentation

Use the RCA template located at `docs/templates/rca-template.md` in this repository. This template follows industry best practices including [Google's SRE Postmortem format](https://sre.google/sre-book/example-postmortem/).

Key practices:

* **Start documentation immediately** when the incident is declared - don't rely on memory
* **Update continuously** throughout the incident response
* **Be blameless** - focus on systems and processes, not individuals
* **Continue from existing documents** - if re-prompted with a cleared context, check for and continue from any existing incident document

#### Five Whys Analysis

Work backwards from the symptom to the root cause:

1. **Why** did the service fail? → {Answer leads to next why}
2. **Why** did that happen? → {Continue drilling down}
3. **Why** was that the case? → {Identify systemic issues}
4. **Why** wasn't this prevented? → {Find gaps in controls}
5. **Why** wasn't this detected earlier? → {Improve monitoring}

## Azure Documentation

Use the microsoft-docs MCP tools to access relevant Azure documentation during incident response. Key documentation areas include:

* Azure Monitor and Log Analytics
* Azure Resource Health and Service Health
* Application Insights
* Service-specific troubleshooting guides

Query documentation dynamically based on the services and symptoms involved in the incident rather than relying on static links.

---

Identify the current phase and proceed with the appropriate workflow steps. Ask clarifying questions when incident details are incomplete.
146 changes: 146 additions & 0 deletions docs/templates/rca-template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
---
title: Root Cause Analysis (RCA) Template
description: Structured post-incident documentation template for root cause analysis
author: Microsoft
ms.date: 2026-02-04
---

This template provides a structured format for post-incident documentation, inspired by industry best practices including [Google's SRE Postmortem Culture](https://sre.google/sre-book/postmortem-culture/) and [Example Postmortem](https://sre.google/sre-book/example-postmortem/).

## Template

```markdown
# Incident Report: {Title}

## Summary

- **Incident ID**: INC-YYYY-MMDD-NNN
- **Date**: {Date}
- **Duration**: {Start} to {End} ({total time})
- **Severity**: {1-4}
- **Services Affected**: {list}
- **Incident Commander**: {Name}

## Executive Summary

{2-3 sentence summary of what happened, impact, and resolution}

## Timeline

All times in UTC.

| Time | Event |
|-------|-------------------------------|
| HH:MM | {First symptom detected} |
| HH:MM | {Incident declared} |
| HH:MM | {Key investigation milestone} |
| HH:MM | {Mitigation applied} |
| HH:MM | {Service restored} |
| HH:MM | {Incident resolved} |

## Impact

- **Users affected**: {count or percentage}
- **Transactions impacted**: {count}
- **Revenue impact**: {if applicable}
- **SLA impact**: {if applicable}
- **Data loss**: {Yes/No, details if applicable}

## Root Cause

{Detailed technical explanation of what caused the incident. Be specific and factual.}

## Contributing Factors

- {Factor 1: e.g., Missing monitoring for specific failure mode}
- {Factor 2: e.g., Documentation gap in runbooks}
- {Factor 3: e.g., Insufficient testing coverage}

## Trigger

{What specific event triggered the incident? Deployment, configuration change, traffic spike, external dependency failure, etc.}

## Resolution

{What was done to resolve the incident? Include specific commands, rollbacks, or configuration changes.}

## Detection

- **How was the incident detected?** {Monitoring alert / Customer report / Manual discovery}
- **Time to detect (TTD)**: {minutes from incident start to detection}
- **Could detection be improved?** {Yes/No, how}

## Response

- **Time to engage (TTE)**: {minutes from detection to first responder}
- **Time to mitigate (TTM)**: {minutes from engagement to mitigation}
- **Time to resolve (TTR)**: {minutes from incident start to full resolution}

## Five Whys Analysis

1. **Why** did the service fail?
→ {Answer}

2. **Why** did that happen?
→ {Answer}

3. **Why** was that the case?
→ {Answer}

4. **Why** wasn't this prevented?
→ {Answer}

5. **Why** wasn't this detected earlier?
→ {Answer}

## Action Items

| ID | Priority | Action | Owner | Due Date | Status |
|----|----------|---------------------------------------|--------|----------|--------|
| 1 | P1 | {Immediate fix to prevent recurrence} | {Name} | {Date} | Open |
| 2 | P2 | {Improve monitoring/alerting} | {Name} | {Date} | Open |
| 3 | P2 | {Update documentation/runbooks} | {Name} | {Date} | Open |
| 4 | P3 | {Long-term systemic improvement} | {Name} | {Date} | Open |

## Lessons Learned

### What went well

- {e.g., Quick detection due to recent monitoring improvements}
- {e.g., Effective communication during incident}

### What went poorly

- {e.g., Runbook was outdated}
- {e.g., Escalation path unclear}

### Where we got lucky

- {e.g., Incident occurred during low-traffic period}
- {e.g., Expert happened to be available}

## Supporting Information

- **Related incidents**: {links to similar past incidents}
- **Monitoring dashboards**: {links}
- **Relevant logs/queries**: {links or references}
- **Slack/Teams thread**: {link to incident channel}
```

## Usage Guidelines

1. **Start the document immediately** when an incident is declared
2. **Update continuously** during the incident - don't rely on memory afterward
3. **Be blameless** - focus on systems and processes, not individuals
4. **Be thorough** - future responders will thank you
5. **Track action items** - incidents without follow-through will repeat

## References

- [Google SRE Book: Postmortem Culture](https://sre.google/sre-book/postmortem-culture/)
- [Google SRE Book: Example Postmortem](https://sre.google/sre-book/example-postmortem/)
- [Atlassian Incident Management](https://www.atlassian.com/incident-management/postmortem)

---

🤖 *Crafted with precision by ✨Copilot following brilliant human instruction, then carefully refined by our team of discerning human reviewers.*
Loading