Observability Overview

First PublishedFeb 16, 2026Last UpdatedMar 31, 2026ByAtif Alam

Observability is the ability to understand what’s happening inside your systems by examining their outputs. When something breaks at 3 AM, observability is what lets you figure out why without guessing.

The Three Pillars

Pillar	What It Is	Example Tools
Metrics	Numeric measurements over time (CPU, memory, request rate, error rate, latency)	Prometheus, Datadog, CloudWatch
Logs	Timestamped text records of events (app errors, access logs, audit trails)	Loki, Elasticsearch, CloudWatch Logs
Traces	End-to-end request paths through distributed services	Jaeger, Tempo, Zipkin, OpenTelemetry

Each pillar answers different questions:

Metrics → “Is something wrong?” (alerting, dashboards)
Logs → “What exactly happened?” (debugging, audit)
Traces → “Where is the bottleneck?” (latency analysis across services)

Network evidence complements the three pillars for platform and connectivity incidents: VPC flow logs show allow/deny and volume at the cloud boundary (Flow logs and network RCA); packet captures show TCP/TLS behavior on hosts (Packet capture). Use them after metrics narrow the time window and owning service.

Why Monitoring Matters

Detect problems before users do — Alerts on error rates, latency spikes, resource exhaustion.
Reduce mean time to resolution (MTTR) — Dashboards and logs help you find the root cause fast.
Capacity planning — Metrics over time show growth trends, so you can scale before hitting limits.
SLOs and SLAs — SLOs, SLIs, and error budgets tie metrics to agreed targets and release tradeoffs; SLAs are often contractual.

The Prometheus + Grafana Ecosystem

This section focuses on the open-source stack most commonly used for Kubernetes and cloud-native monitoring:

1
                  scrape                        query metrics
2
┌───────────┐  ◄──────────  ┌──────────────┐  ◄──────────────  ┌─────────────┐
3
│  Your App │               │  Prometheus   │                   │             │
4
│ (metrics) │               │  (TSDB)       │────► Alert Rules  │   Grafana   │
5
└───────────┘               └──────────────┘     ──► Alertmgr   │ (dashboards)│
6
                                                                │             │
7
                  push logs                     query logs       │             │
8
┌───────────┐  ──────────►  ┌──────────────┐  ◄──────────────  │             │
9
│  Your App │               │    Loki       │                   │             │
10
│  (logs)   │               │  (log store)  │                   └─────────────┘
11
└───────────┘               └──────────────┘
12

13
                  OpenTelemetry / OTLP          query traces
14
┌───────────┐  ────────────────►  ┌──────────────┐  ◄────────────  ┌─────────────┐
15
│  Your App │                    │ Grafana Tempo │                │   Grafana   │
16
│  (traces) │                    │ (trace store) │                │  (Explore)  │
17
└───────────┘                    └──────────────┘                └─────────────┘

Component	Role
Prometheus	Scrapes and stores metrics, evaluates alert rules
Grafana	Visualizes metrics, logs, and traces; dashboards and Explore
Alertmanager	Routes, groups, and delivers alerts (Slack, PagerDuty, email)
Loki	Stores and queries logs (like Prometheus but for logs)
Grafana Tempo	Stores traces; often fed by OpenTelemetry; query from Grafana
Exporters	Expose metrics from systems that don’t natively support Prometheus (Node exporter, MySQL exporter, etc.)

Monitoring vs Observability

Monitoring is a subset of observability — it tells you something is wrong (dashboards, alerts).
Observability goes further — it lets you ask arbitrary questions about your system to understand why (ad-hoc queries, correlation across metrics/logs/traces).

In practice, the term “observability” is used broadly to cover both.

Topics in This Section

Start with Prometheus (metrics collection), then Grafana (visualization), then SLOs, alerting, exporters, Loki (logs), setup, OpenTelemetry, tracing, and scaling.

Prometheus — Architecture, scrape config, metric types, and PromQL queries.
Grafana — Data sources, dashboards, panels, variables, and visualization types.
SLOs, SLIs, and error budgets — SLIs, SLO targets, error budgets, and tradeoffs with velocity and cost.
Alerting — Alertmanager routing, Grafana alerts, alert design best practices.
Exporters — Node exporter, blackbox exporter, application instrumentation, and custom exporters.
Loki — Log aggregation, LogQL, labels, and Grafana integration.
Observability Setup — Docker Compose and Kubernetes setup for the full stack from zero to dashboards.
OpenTelemetry — Unified instrumentation for metrics, logs, and traces with the OTel SDK, auto-instrumentation, and the OTel Collector.
Distributed Tracing — Following requests across services with spans, context propagation, Jaeger, and Grafana Tempo.
Scaling Prometheus — Long-term storage and global querying with Thanos, Cortex, and Grafana Mimir.

AIOps — AI-assisted anomaly detection, correlation, and diagnostics that build on metrics, logs, and traces.