skills/debugging/incident-response-stabilization.md

# Incident Response and Stabilization

## Purpose

Guide high-pressure response to live or high-impact issues by separating immediate stabilization from deeper root-cause correction.

## When to use

- A production issue is actively impacting users or operators
- A regression needs containment before a complete fix is ready
- The team needs a calm sequence for triage, mitigation, and follow-up
- Communication and operational clarity matter as much as code changes

## Inputs to gather

- Current symptoms, severity, affected users, and timing
- Available logs, metrics, alerts, dashboards, and recent changes
- Safe rollback, feature flag, degrade, or traffic-shaping options
- Stakeholders who need updates and what they need to know

## How to work

- Stabilize user impact first if a safe containment path exists.
- Keep mitigation, diagnosis, and communication distinct but coordinated.
- Prefer reversible steps under uncertainty.
- Record what is confirmed versus assumed while the incident is active.
- After stabilization, convert the incident into structured debugging and prevention work.

## Output expectations

- Stabilization plan or incident response summary
- Clear mitigation status and next actions
- Follow-up work for root cause, observability, and prevention

## Quality checklist

- User impact reduction is prioritized appropriately.
- Risky irreversible changes are avoided under pressure.
- Communication is clear enough for collaborators to act.
- Post-incident follow-up is not lost after immediate recovery.

## Handoff notes

- Note what was mitigated versus actually fixed.
- Pair with debugging workflow and observability once the system is stable enough for deeper work.
copy paste 2026-03-23 15:29:14 -05:00			`# Incident Response and Stabilization`

			`## Purpose`

			`Guide high-pressure response to live or high-impact issues by separating immediate stabilization from deeper root-cause correction.`

			`## When to use`

			`- A production issue is actively impacting users or operators`
			`- A regression needs containment before a complete fix is ready`
			`- The team needs a calm sequence for triage, mitigation, and follow-up`
			`- Communication and operational clarity matter as much as code changes`

			`## Inputs to gather`

			`- Current symptoms, severity, affected users, and timing`
			`- Available logs, metrics, alerts, dashboards, and recent changes`
			`- Safe rollback, feature flag, degrade, or traffic-shaping options`
			`- Stakeholders who need updates and what they need to know`

			`## How to work`

			`- Stabilize user impact first if a safe containment path exists.`
			`- Keep mitigation, diagnosis, and communication distinct but coordinated.`
			`- Prefer reversible steps under uncertainty.`
			`- Record what is confirmed versus assumed while the incident is active.`
			`- After stabilization, convert the incident into structured debugging and prevention work.`

			`## Output expectations`

			`- Stabilization plan or incident response summary`
			`- Clear mitigation status and next actions`
			`- Follow-up work for root cause, observability, and prevention`

			`## Quality checklist`

			`- User impact reduction is prioritized appropriately.`
			`- Risky irreversible changes are avoided under pressure.`
			`- Communication is clear enough for collaborators to act.`
			`- Post-incident follow-up is not lost after immediate recovery.`

			`## Handoff notes`

			`- Note what was mitigated versus actually fixed.`
			`- Pair with debugging workflow and observability once the system is stable enough for deeper work.`