Incident response and on-call
Incidents are normal in distributed systems. What matters is predictable response: clear roles, calm communication, tooling that supports triage, and learning that sticks. This page stays generic—adapt names, channels, and severities to your organization.
Related: Alerting, QA and reliability guide §4, SLOs and error budgets.
Roles (typical)
Section titled “Roles (typical)”| Role | Focus |
|---|---|
| Incident commander / lead | Drives timeline, decisions, and comms; may delegate investigation. |
| Communications | Status updates for stakeholders and users (internal or external). |
| Technical investigators | Dig into metrics, logs, traces, recent changes. |
Small teams merge roles—still name who is doing what so work does not duplicate or stall.
First minutes
Section titled “First minutes”- Acknowledge the page or ticket so others know it is owned.
- Triage severity using your org’s definitions (user impact, scope, data risk).
- Stabilize — stop bleeding (rollback, scale, feature flag, traffic shift) before chasing root cause when users are still impacted.
- Open a war room (chat or call) with a single source of truth for status.
Use observability in order: metrics for “what changed,” logs for detail, traces for cross-service latency.
Communication
Section titled “Communication”- Short, timestamped updates on what is known, what is being tried, and next update time.
- Avoid blame in public channels; focus on systems and facts.
- Coordinate before restarting or redeploying so you do not fight your teammates.
Escalation
Section titled “Escalation”Escalation paths depend on org structure. Common patterns:
- L1 — on-call for the service or platform.
- L2 — domain expert or dependency owner.
- L3 — vendor or infrastructure team.
Document how to page the next level (channel, tool, phone tree) before you need it.
On-call sustainability
Section titled “On-call sustainability”| Practice | Why |
|---|---|
| Rotation | Spread load; avoid single points of failure for people. |
| Handoff | Outgoing shift summarizes open incidents and noisy alerts. |
| Alert budget | Too many pages → fatigue and missed real fires; tune alerts (see Alerting). |
| Follow-the-sun | For global teams, align rotations with time zones fairly. |
Blameless postmortem
Section titled “Blameless postmortem”After the service is stable, run a blameless review:
- Timeline — what happened, when, who did what.
- Impact — duration, users, SLO or budget consumed (see SLOs).
- Root causes — often multiple contributing factors; resist a single “human error” story.
- What went well — reinforce good tooling and decisions.
- Action items — each with an owner and due date; track to completion.
The goal is learning and system change, not assigning fault.
Checklist
Section titled “Checklist”- Incident channel or ticket exists; severity agreed.
- Mitigation before deep RCA when users are hurt.
- Postmortem scheduled within a few business days for meaningful incidents.
- Action items tracked in your work system (not only in meeting notes).
Related
Section titled “Related”- Service readiness checklist — reduce surprise incidents
- Python incident records example — practice with structured data