Skip to content

Observability Overview

First PublishedLast UpdatedByAtif Alam

Observability is the ability to understand what’s happening inside your systems by examining their outputs. When something breaks at 3 AM, observability is what lets you figure out why without guessing.

PillarWhat It IsExample Tools
MetricsNumeric measurements over time (CPU, memory, request rate, error rate, latency)Prometheus, Datadog, CloudWatch
LogsTimestamped text records of events (app errors, access logs, audit trails)Loki, Elasticsearch, CloudWatch Logs
TracesEnd-to-end request paths through distributed servicesJaeger, Tempo, Zipkin, OpenTelemetry

Each pillar answers different questions:

  • Metrics → “Is something wrong?” (alerting, dashboards)
  • Logs → “What exactly happened?” (debugging, audit)
  • Traces → “Where is the bottleneck?” (latency analysis across services)

Network evidence complements the three pillars for platform and connectivity incidents: VPC flow logs show allow/deny and volume at the cloud boundary (Flow logs and network RCA); packet captures show TCP/TLS behavior on hosts (Packet capture). Use them after metrics narrow the time window and owning service.

  • Detect problems before users do — Alerts on error rates, latency spikes, resource exhaustion.
  • Reduce mean time to resolution (MTTR) — Dashboards and logs help you find the root cause fast.
  • Capacity planning — Metrics over time show growth trends, so you can scale before hitting limits.
  • SLOs and SLAsSLOs, SLIs, and error budgets tie metrics to agreed targets and release tradeoffs; SLAs are often contractual.

This section focuses on the open-source stack most commonly used for Kubernetes and cloud-native monitoring:

scrape query metrics
┌───────────┐ ◄────────── ┌──────────────┐ ◄────────────── ┌─────────────┐
│ Your App │ │ Prometheus │ │ │
│ (metrics) │ │ (TSDB) │────► Alert Rules │ Grafana │
└───────────┘ └──────────────┘ ──► Alertmgr │ (dashboards)│
│ │
push logs query logs │ │
┌───────────┐ ──────────► ┌──────────────┐ ◄────────────── │ │
│ Your App │ │ Loki │ │ │
│ (logs) │ │ (log store) │ └─────────────┘
└───────────┘ └──────────────┘
OpenTelemetry / OTLP query traces
┌───────────┐ ────────────────► ┌──────────────┐ ◄──────────── ┌─────────────┐
│ Your App │ │ Grafana Tempo │ │ Grafana │
│ (traces) │ │ (trace store) │ │ (Explore) │
└───────────┘ └──────────────┘ └─────────────┘
ComponentRole
PrometheusScrapes and stores metrics, evaluates alert rules
GrafanaVisualizes metrics, logs, and traces; dashboards and Explore
AlertmanagerRoutes, groups, and delivers alerts (Slack, PagerDuty, email)
LokiStores and queries logs (like Prometheus but for logs)
Grafana TempoStores traces; often fed by OpenTelemetry; query from Grafana
ExportersExpose metrics from systems that don’t natively support Prometheus (Node exporter, MySQL exporter, etc.)
  • Monitoring is a subset of observability — it tells you something is wrong (dashboards, alerts).
  • Observability goes further — it lets you ask arbitrary questions about your system to understand why (ad-hoc queries, correlation across metrics/logs/traces).

In practice, the term “observability” is used broadly to cover both.

  1. Prometheus — metrics model and PromQL.
  2. Grafana — dashboards and data sources.
  3. SLOs, SLIs, and error budgets — how metrics become targets and budgets.
  4. Alerting — Alertmanager, noise, and severity.
  5. Exporters — node, blackbox, and app metrics.
  6. Loki — logs and LogQL.
  7. Observability Setup — compose or Kubernetes bring-up.
  8. OpenTelemetry — unified instrumentation.
  9. Distributed Tracing — spans, Tempo/Jaeger, propagation.
  10. Scaling Prometheus — Thanos, Mimir, long-term storage.

Skip steps you already know; use the list as a skills path for the Prometheus–Grafana–Loki–Tempo–OTel stack.

Start with Prometheus (metrics collection), then Grafana (visualization), then SLOs, alerting, exporters, Loki (logs), setup, OpenTelemetry, tracing, and scaling.

  • Prometheus — Architecture, scrape config, metric types, and PromQL queries.
  • Grafana — Data sources, dashboards, panels, variables, and visualization types.
  • SLOs, SLIs, and error budgets — SLIs, SLO targets, error budgets, and tradeoffs with velocity and cost.
  • Alerting — Alertmanager routing, Grafana alerts, alert design best practices.
  • Exporters — Node exporter, blackbox exporter, application instrumentation, and custom exporters.
  • Loki — Log aggregation, LogQL, labels, and Grafana integration.
  • Observability Setup — Docker Compose and Kubernetes setup for the full stack from zero to dashboards.
  • OpenTelemetry — Unified instrumentation for metrics, logs, and traces with the OTel SDK, auto-instrumentation, and the OTel Collector.
  • Distributed Tracing — Following requests across services with spans, context propagation, Jaeger, and Grafana Tempo.
  • Scaling Prometheus — Long-term storage and global querying with Thanos, Cortex, and Grafana Mimir.
  • AIOps — AI-assisted anomaly detection, correlation, and diagnostics that build on metrics, logs, and traces.