Incident response and on-call

First PublishedMar 31, 2026ByAtif Alam

Incidents are normal in distributed systems. What matters is predictable response: clear roles, calm communication, tooling that supports triage, and learning that sticks. This page stays generic—adapt names, channels, and severities to your organization.

Roles (typical)

Role	Focus
Incident commander / lead	Drives timeline, decisions, and comms; may delegate investigation.
Communications	Status updates for stakeholders and users (internal or external).
Technical investigators	Dig into metrics, logs, traces, recent changes.

Small teams merge roles—still name who is doing what so work does not duplicate or stall.

First minutes

Acknowledge the page or ticket so others know it is owned.
Triage severity using your org’s definitions (user impact, scope, data risk).
Stabilize — stop bleeding (rollback, scale, feature flag, traffic shift) before chasing root cause when users are still impacted.
Open a war room (chat or call) with a single source of truth for status.

Use observability in order: metrics for “what changed,” logs for detail, traces for cross-service latency.

Communication

Short, timestamped updates on what is known, what is being tried, and next update time.
Avoid blame in public channels; focus on systems and facts.
Coordinate before restarting or redeploying so you do not fight your teammates.

Escalation

Escalation paths depend on org structure. Common patterns:

L1 — on-call for the service or platform.
L2 — domain expert or dependency owner.
L3 — vendor or infrastructure team.

Document how to page the next level (channel, tool, phone tree) before you need it.

On-call sustainability

Practice	Why
Rotation	Spread load; avoid single points of failure for people.
Handoff	Outgoing shift summarizes open incidents and noisy alerts.
Alert budget	Too many pages → fatigue and missed real fires; tune alerts (see Alerting).
Follow-the-sun	For global teams, align rotations with time zones fairly.

Blameless postmortem

After the service is stable, run a blameless review:

Timeline — what happened, when, who did what.
Impact — duration, users, SLO or budget consumed (see SLOs).
Root causes — often multiple contributing factors; resist a single “human error” story.
What went well — reinforce good tooling and decisions.
Action items — each with an owner and due date; track to completion.

The goal is learning and system change, not assigning fault.

Checklist

Incident channel or ticket exists; severity agreed.
Mitigation before deep RCA when users are hurt.
Postmortem scheduled within a few business days for meaningful incidents.
Action items tracked in your work system (not only in meeting notes).

Service readiness checklist — reduce surprise incidents
Python incident records example — practice with structured data