Files
memer/skills/debugging/incident-response-stabilization.md
2026-03-28 00:43:27 -05:00

1.7 KiB

Incident Response and Stabilization

Purpose

Guide high-pressure response to live or high-impact issues by separating immediate stabilization from deeper root-cause correction.

When to use

  • A production issue is actively impacting users or operators
  • A regression needs containment before a complete fix is ready
  • The team needs a calm sequence for triage, mitigation, and follow-up
  • Communication and operational clarity matter as much as code changes

Inputs to gather

  • Current symptoms, severity, affected users, and timing
  • Available logs, metrics, alerts, dashboards, and recent changes
  • Safe rollback, feature flag, degrade, or traffic-shaping options
  • Stakeholders who need updates and what they need to know

How to work

  • Stabilize user impact first if a safe containment path exists.
  • Keep mitigation, diagnosis, and communication distinct but coordinated.
  • Prefer reversible steps under uncertainty.
  • Record what is confirmed versus assumed while the incident is active.
  • After stabilization, convert the incident into structured debugging and prevention work.

Output expectations

  • Stabilization plan or incident response summary
  • Clear mitigation status and next actions
  • Follow-up work for root cause, observability, and prevention

Quality checklist

  • User impact reduction is prioritized appropriately.
  • Risky irreversible changes are avoided under pressure.
  • Communication is clear enough for collaborators to act.
  • Post-incident follow-up is not lost after immediate recovery.

Handoff notes

  • Note what was mitigated versus actually fixed.
  • Pair with debugging workflow and observability once the system is stable enough for deeper work.