Skip to content

Incident response and on-call

First PublishedByAtif Alam

Incidents are normal in distributed systems. What matters is predictable response: clear roles, calm communication, tooling that supports triage, and learning that sticks. This page stays generic—adapt names, channels, and severities to your organization.

Related: Alerting, QA and reliability guide §4, SLOs and error budgets.

RoleFocus
Incident commander / leadDrives timeline, decisions, and comms; may delegate investigation.
CommunicationsStatus updates for stakeholders and users (internal or external).
Technical investigatorsDig into metrics, logs, traces, recent changes.

Small teams merge roles—still name who is doing what so work does not duplicate or stall.

  1. Acknowledge the page or ticket so others know it is owned.
  2. Triage severity using your org’s definitions (user impact, scope, data risk).
  3. Stabilize — stop bleeding (rollback, scale, feature flag, traffic shift) before chasing root cause when users are still impacted.
  4. Open a war room (chat or call) with a single source of truth for status.

Use observability in order: metrics for “what changed,” logs for detail, traces for cross-service latency.

  • Short, timestamped updates on what is known, what is being tried, and next update time.
  • Avoid blame in public channels; focus on systems and facts.
  • Coordinate before restarting or redeploying so you do not fight your teammates.

Escalation paths depend on org structure. Common patterns:

  • L1 — on-call for the service or platform.
  • L2 — domain expert or dependency owner.
  • L3 — vendor or infrastructure team.

Document how to page the next level (channel, tool, phone tree) before you need it.

PracticeWhy
RotationSpread load; avoid single points of failure for people.
HandoffOutgoing shift summarizes open incidents and noisy alerts.
Alert budgetToo many pages → fatigue and missed real fires; tune alerts (see Alerting).
Follow-the-sunFor global teams, align rotations with time zones fairly.

After the service is stable, run a blameless review:

  1. Timeline — what happened, when, who did what.
  2. Impact — duration, users, SLO or budget consumed (see SLOs).
  3. Root causes — often multiple contributing factors; resist a single “human error” story.
  4. What went well — reinforce good tooling and decisions.
  5. Action items — each with an owner and due date; track to completion.

The goal is learning and system change, not assigning fault.

  • Incident channel or ticket exists; severity agreed.
  • Mitigation before deep RCA when users are hurt.
  • Postmortem scheduled within a few business days for meaningful incidents.
  • Action items tracked in your work system (not only in meeting notes).