Grafana
Grafana is an open-source visualization and analytics platform. It connects to data sources (Prometheus, Loki, Elasticsearch, CloudWatch, etc.) and turns queries into dashboards, charts, and alerts.
Data Sources
Section titled “Data Sources”Grafana doesn’t store data — it queries external sources. Add a data source in Configuration → Data Sources:
| Data Source | Used For |
|---|---|
| Prometheus | Metrics (PromQL queries) |
| Loki | Logs (LogQL queries) |
| Elasticsearch | Logs, metrics, search |
| CloudWatch | AWS metrics and logs |
| InfluxDB | Time-series metrics |
| PostgreSQL / MySQL | Business data, custom queries |
| Tempo / Jaeger | Distributed traces |
You can have multiple data sources of the same type (e.g. one Prometheus for production, another for staging).
Dashboards
Section titled “Dashboards”A dashboard is a collection of panels (charts, tables, stats) arranged in rows.
Creating a Dashboard
Section titled “Creating a Dashboard”- Click + → New Dashboard.
- Add a panel.
- Choose a data source and write a query.
- Select a visualization type.
- Configure panel options (title, legend, thresholds).
- Save the dashboard.
Dashboard JSON Model
Section titled “Dashboard JSON Model”Dashboards are stored as JSON. You can:
- Export a dashboard as JSON for version control.
- Import a JSON file or paste a dashboard ID from grafana.com/dashboards.
- Provision dashboards from files on disk (for GitOps / config-as-code).
Provisioning (Config as Code)
Section titled “Provisioning (Config as Code)”Place YAML configs and JSON dashboards in Grafana’s provisioning directory:
apiVersion: 1providers: - name: default folder: "" type: file options: path: /var/lib/grafana/dashboardsapiVersion: 1datasources: - name: Prometheus type: prometheus url: http://prometheus:9090 isDefault: trueThis lets you deploy Grafana with dashboards and data sources pre-configured — no manual setup.
Panels and Visualization Types
Section titled “Panels and Visualization Types”Time Series (Default)
Section titled “Time Series (Default)”Line/area/bar chart over time. The most common panel:
rate(http_requests_total[5m])Options: line width, fill opacity, gradient, stacking, point size, thresholds.
Single large number with optional sparkline. Good for KPIs:
sum(rate(http_requests_total[5m]))Shows: “2,345 req/s”
Circular gauge showing a value against a range:
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100Shows: 72% with color thresholds (green/yellow/red).
Bar Chart
Section titled “Bar Chart”Compare values across categories:
sum by (method) (rate(http_requests_total[5m]))Tabular data with sortable columns:
topk(10, sum by (instance) (rate(http_requests_total[5m])))Heatmap
Section titled “Heatmap”Visualize distributions over time (e.g. latency buckets):
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))Logs Panel
Section titled “Logs Panel”Display log lines from Loki or Elasticsearch:
{app="my-app"} |= "error"Other Types
Section titled “Other Types”- Pie chart — Proportions
- State timeline — Status over time (up/down/degraded)
- Alert list — Current firing alerts
- Text — Markdown or HTML for notes and documentation
Variables (Templating)
Section titled “Variables (Templating)”Variables make dashboards dynamic — users can switch between environments, hosts, or services without editing queries.
Defining a Variable
Section titled “Defining a Variable”Dashboard Settings → Variables → Add variable:
Name: instanceType: QueryData source: PrometheusQuery: label_values(up, instance)This populates a dropdown with all instance label values.
Using Variables in Queries
Section titled “Using Variables in Queries”rate(http_requests_total{instance="$instance"}[5m])The $instance is replaced with the selected value from the dropdown.
Common Variable Patterns
Section titled “Common Variable Patterns”| Variable | Query | Purpose |
|---|---|---|
job | label_values(up, job) | Select by job |
instance | label_values(up{job="$job"}, instance) | Chain: instances for selected job |
namespace | label_values(kube_pod_info, namespace) | Kubernetes namespace |
interval | Custom: 1m, 5m, 15m, 1h | Adjustable time range |
Chained Variables
Section titled “Chained Variables”When one variable depends on another (e.g. namespace → pod):
- Create
namespacevariable:label_values(kube_pod_info, namespace) - Create
podvariable:label_values(kube_pod_info{namespace="$namespace"}, pod)
Selecting a namespace automatically filters the pod list.
Repeating Panels
Section titled “Repeating Panels”Repeat a panel for each value of a variable:
- Set the variable to allow multi-value selection.
- In the panel, enable Repeat → Variable: instance.
Grafana creates one panel per selected instance — useful for “per-host” views.
Annotations
Section titled “Annotations”Mark events on time-series panels (deploys, incidents, config changes):
# Query annotation sourceALERTS{alertname="HighErrorRate"}Or add manual annotations by clicking on the graph and writing a note.
Sharing and Exporting
Section titled “Sharing and Exporting”- Share link — Direct URL with current time range and variables.
- Snapshot — Static copy of the dashboard (no live data).
- Export JSON — Full dashboard definition for version control.
- Embed panel — iframe embed for external pages.
- PDF/PNG — Via Grafana Image Renderer plugin.
Dashboard Design Patterns
Section titled “Dashboard Design Patterns”Well-designed dashboards answer questions quickly. Poorly designed ones become “wall of graphs” that nobody reads. These patterns help you build dashboards that are actually useful.
The USE Method (Infrastructure)
Section titled “The USE Method (Infrastructure)”For every resource (CPU, memory, disk, network), show three things:
| Signal | Meaning | Example Panel |
|---|---|---|
| Utilization | How busy is it? (%) | node_cpu_seconds_total → CPU usage % |
| Saturation | How overloaded is it? (queue depth) | node_load1 → load average |
| Errors | Is it failing? | node_disk_io_time_weighted_seconds_total |
Layout: One row per resource, three panels per row.
The RED Method (Services)
Section titled “The RED Method (Services)”For every service (API, microservice), show three things:
| Signal | Meaning | Example Panel |
|---|---|---|
| Rate | Requests per second | rate(http_requests_total[5m]) |
| Errors | Error rate (% or count) | rate(http_requests_total{status=~"5.."}[5m]) |
| Duration | Latency (p50, p95, p99) | histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) |
Layout: One row per service, three panels per row. This is the most common pattern for microservice dashboards.
The Four Golden Signals (Google SRE)
Section titled “The Four Golden Signals (Google SRE)”Google’s SRE book recommends monitoring these for every user-facing system:
| Signal | What to Measure |
|---|---|
| Latency | Time to serve a request (separate success vs error latency) |
| Traffic | Requests per second |
| Errors | Rate of failed requests |
| Saturation | How “full” the service is (CPU, memory, queue depth) |
RED covers the first three; add a saturation panel (CPU/memory of the service pods) for the fourth.
Dashboard Layout Patterns
Section titled “Dashboard Layout Patterns”Overview → Detail (Drill-Down):
┌─────────────────────────────────────────────────┐│ Row 1: Key stats (stat panels) ││ [Total RPS] [Error %] [p99 Latency] [Pods] │├─────────────────────────────────────────────────┤│ Row 2: Time series (trends) ││ [Request rate over time] [Error rate] │├─────────────────────────────────────────────────┤│ Row 3: Per-instance breakdown ││ [Latency by pod] [CPU by pod] │├─────────────────────────────────────────────────┤│ Row 4: Logs (Loki panel) ││ [Recent errors from Loki] │└─────────────────────────────────────────────────┘This pattern gives you the summary at the top and lets you scroll down for detail.
Service Map (Multi-Service):
Create a dashboard per service (using the RED method), then link them:
- A “Platform Overview” dashboard shows all services as stat panels.
- Clicking a service stat links to that service’s detailed dashboard.
- Use Grafana’s Dashboard Links and pass variables.
Panel Design Tips
Section titled “Panel Design Tips”| Tip | Why |
|---|---|
| Put stat panels at the top | Instant overview of current state |
| Use thresholds and colors | Green/yellow/red makes problems visible without reading numbers |
| Label axes | ”Requests per second” not just “rate” |
| Set meaningful Y-axis limits | Don’t auto-scale from 0.001 to 0.002 — it looks like a crisis |
| Use the right unit | Grafana supports reqps, bytes, percent, seconds, etc. |
| Add descriptions to panels | Hover-text explaining what the panel shows and what “bad” looks like |
| Collapse rows | Group related panels; default-collapse less important sections |
| Limit to 10–15 panels | More than that = information overload |
Anti-Patterns to Avoid
Section titled “Anti-Patterns to Avoid”| Anti-Pattern | Problem | Fix |
|---|---|---|
| Wall of graphs | 30+ panels, no hierarchy | Use rows, collapse, and a summary row at top |
| No variables | Separate dashboard per environment | Add $environment, $namespace, $service variables |
| Raw metric names as titles | ”node_cpu_seconds_total” means nothing to on-call | Use human-readable titles: “CPU Usage (%)“ |
| Default time range too wide | 7-day view hides the last-10-minute spike | Set default to “Last 1 hour” for operational dashboards |
| No alerting link | Dashboard shows a problem but no way to see related alerts | Add an Alert List panel or link to alert rules |
| Mixing audiences | Dev metrics + business metrics on one dashboard | Separate: “Service Health” (ops) vs “Business KPIs” (product) |
Dashboard-as-Code
Section titled “Dashboard-as-Code”Store dashboards in Git and provision them automatically:
- Export dashboard JSON from Grafana UI.
- Parameterize data source names using
${DS_PROMETHEUS}variables. - Commit to a
dashboards/directory in your repo. - Use Grafana provisioning or a Kubernetes ConfigMap to load on startup.
# Kubernetes ConfigMap for dashboard provisioningapiVersion: v1kind: ConfigMapmetadata: name: grafana-dashboards labels: grafana_dashboard: "1" # Grafana sidecar picks this updata: service-health.json: | { ... exported dashboard JSON ... }The kube-prometheus-stack Helm chart’s Grafana sidecar auto-discovers ConfigMaps with the grafana_dashboard label and loads them.
Key Takeaways
Section titled “Key Takeaways”- Grafana connects to data sources — it doesn’t store data itself.
- Use variables to make dashboards dynamic (environment, host, namespace dropdowns).
- Provision data sources and dashboards from files for config-as-code deployments.
- Choose the right panel type: time series for trends, stat for KPIs, heatmap for distributions, table for top-N lists.
- Export dashboards as JSON and commit to Git — treat dashboards as code.
- Use the RED method (Rate, Errors, Duration) for service dashboards and the USE method (Utilization, Saturation, Errors) for infrastructure dashboards.
- Design dashboards with a summary row at top, details below, and 10–15 panels max to avoid information overload.