Skip to content

SLOs, SLIs, and error budgets

First PublishedByAtif Alam

Reliability is not “zero incidents”—it is agreed behavior over time. SLIs measure that behavior; SLOs set the target; error budgets turn the gap between perfect and the target into a shared quantity you can spend (releases) or protect (hold the line).

This page is framework-agnostic. You implement SLIs with metrics—often in Prometheus—and policies with people and process. See Alerting for routing and noise control.

TermMeaning
SLI (service level indicator)A measurable signal of good service from the user’s perspective (e.g. successful HTTP requests, latency under a threshold).
SLO (service level objective)A target for an SLI over a window (e.g. “99.9% of requests succeed per 30 days”).
Error budgetThe allowable bad events implied by the SLO: if availability is 99.9%, you have 0.1% “budget” for errors in that window.
SLA (service level agreement)Often a contract with customers (refunds, credits). SLOs are usually internal targets that may be stricter than the SLA.

Prefer signals that reflect real user or business impact:

  • Availability — proportion of requests that succeed (or meet latency) over a period.
  • Latency — proportion of requests faster than a threshold (e.g. p99 under 300 ms).

Avoid infrastructure-only SLIs unless they clearly proxy user pain (e.g. “Kafka lag” only if it directly drives user-visible delay).

  1. Choose the SLI — e.g. good_requests / total_requests for HTTP.
  2. Pick a time window — rolling 30 days is common; calendar months work for reporting.
  3. Set the target — e.g. 99.9% availability. Higher targets cost more engineering and infrastructure; cost vs reliability is a real tradeoff.
  4. Measure with the same math everywhere — dashboard, alert rules, and postmortems should agree on the definition.

Example (conceptual): if total_requests and failed_requests exist as counters, availability over a window is roughly:

availability ≈ 1 - (failed_requests / total_requests)

In PromQL, you typically use rate() or increase() over a range and compare ratios—see your metrics’ exact names and labels.

If your SLO is 99.9% good requests in 30 days, the error budget is 0.1% of requests in that window—requests that may fail without breaking the SLO.

  • Budget remaining → you can accept more risk (experiments, faster deploys) if policy allows.
  • Budget exhausted or burning fastslow releases, freeze non-critical changes, or invest in reliability until the burn slows.

Velocity vs reliability: shipping faster often consumes budget faster. Cost: tighter SLOs usually mean more redundancy, testing, and on-call attention—explicit targets make those tradeoffs discussable.

Alert on symptoms and budget burn, not every blip:

  • Fast burn — error rate is high enough that you will miss the SLO soon unless you act (page).
  • Slow burn — trend will miss the SLO over the window if it continues (ticket or lower urgency).

Google’s multi-window / multi-burn-rate approach is a common pattern; your Alertmanager routes can mirror severity. Keep pages actionable—see Alerting.

  • SLI definitions are documented and match dashboards and alerts.
  • SLO targets are agreed with product and engineering (not only SRE).
  • Error budget policy says what happens when budget is low (e.g. release freeze, reliability sprint).
  • Postmortems reference whether the incident consumed budget and what will change.