Skip to content

Service readiness checklist

First PublishedByAtif Alam

Readiness means the team can operate the service: observe it, deploy it safely, and respond when things go wrong. Use this list as a starting point—adjust for your risk level and compliance needs.

Related: Kubernetes production patterns, SLOs and error budgets, Pipeline fundamentals, QA and reliability guide.

  • Metrics expose golden signals (latency, traffic, errors, saturation) for the workload; scraped or collected reliably.
  • Dashboards exist for normal operation and failure modes; someone owns keeping them accurate.
  • Logs are structured or searchable enough for incident triage; retention meets audit or debug needs.
  • Traces (if applicable) propagate context for critical paths.
  • Alerts fire on user-visible symptoms or SLO burn, not only CPU graphs—see Alerting.
  • Probes (liveness/readiness/startup) match real dependencies; see Production patterns.
  • Resource requests and limits set; HPA or scaling story documented.
  • PodDisruptionBudget where availability during node drains matters.
  • Rolling update strategy appropriate; rollback path tested.
  • Capacity understood for expected load (see capacity section in Production patterns).
  • Pipeline runs tests appropriate to risk (unit, integration, security scans as required).
  • Artifacts immutable and traceable to a git revision.
  • Deployment strategy (rolling, canary, blue/green) chosen with tradeoffs in mind.
  • Feature flags or config for safe disable of risky paths when needed.
  • On-call rotation and escalation path defined; see Incident response and on-call.
  • SLOs agreed where applicable; error budget policy understood—see SLOs.
  • Runbooks or playbooks for common failures (even short bullets help).