jason/memer

Files

jason 796c374d38 agent init

2026-03-28 00:43:27 -05:00

1.7 KiB

Raw Blame History

Incident Response and Stabilization

Purpose

Guide high-pressure response to live or high-impact issues by separating immediate stabilization from deeper root-cause correction.

When to use

A production issue is actively impacting users or operators
A regression needs containment before a complete fix is ready
The team needs a calm sequence for triage, mitigation, and follow-up
Communication and operational clarity matter as much as code changes

Inputs to gather

Current symptoms, severity, affected users, and timing
Available logs, metrics, alerts, dashboards, and recent changes
Safe rollback, feature flag, degrade, or traffic-shaping options
Stakeholders who need updates and what they need to know

How to work

Stabilize user impact first if a safe containment path exists.
Keep mitigation, diagnosis, and communication distinct but coordinated.
Prefer reversible steps under uncertainty.
Record what is confirmed versus assumed while the incident is active.
After stabilization, convert the incident into structured debugging and prevention work.

Output expectations

Stabilization plan or incident response summary
Clear mitigation status and next actions
Follow-up work for root cause, observability, and prevention

Quality checklist

User impact reduction is prioritized appropriately.
Risky irreversible changes are avoided under pressure.
Communication is clear enough for collaborators to act.
Post-incident follow-up is not lost after immediate recovery.

Handoff notes

Note what was mitigated versus actually fixed.
Pair with debugging workflow and observability once the system is stable enough for deeper work.