Skip to content

Alerting

First PublishedLast UpdatedByAtif Alam

Alerting closes the loop: metrics tell you something is wrong, and alerts notify the right people before users notice.

Alert rules are defined in Prometheus and evaluated against live metrics:

rules/alerts.yml
groups:
- name: http_alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate ({{ $value | humanizePercentage }})"
description: "More than 5% of requests are failing for the last 5 minutes."
- alert: HighLatency
expr: |
histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
> 1.0
for: 10m
labels:
severity: warning
annotations:
summary: "95th percentile latency above 1s"
  • expr — PromQL expression. When it returns results, the alert fires.
  • for — How long the condition must be true before the alert fires (avoids flapping).
  • labels — Metadata for routing (severity, team, service).
  • annotations — Human-readable description with template variables.

Load rules in prometheus.yml:

rule_files:
- "rules/*.yml"
StateMeaning
InactiveExpression returns nothing — all clear
PendingExpression is true, but for duration hasn’t elapsed yet
FiringExpression has been true for the for duration — alert is sent to Alertmanager

Alertmanager receives alerts from Prometheus and handles routing, grouping, silencing, and delivering notifications.

alertmanager.yml
global:
resolve_timeout: 5m
route:
receiver: default-slack
group_by: [alertname, severity]
group_wait: 30s # wait before sending the first notification
group_interval: 5m # wait between notifications for new alerts in the same group
repeat_interval: 4h # re-send if alert is still firing
routes:
- match:
severity: critical
receiver: pagerduty-critical
- match:
severity: warning
receiver: slack-warnings
- match_re:
service: "db|redis"
receiver: database-team
receivers:
- name: default-slack
slack_configs:
- api_url: "https://hooks.slack.com/services/XXX/YYY/ZZZ"
channel: "#alerts"
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
- name: pagerduty-critical
pagerduty_configs:
- service_key: "your-pagerduty-key"
- name: slack-warnings
slack_configs:
- api_url: "https://hooks.slack.com/services/XXX/YYY/ZZZ"
channel: "#warnings"
- name: database-team
email_configs:
- to: "db-team@example.com"
Alert comes in with labels: {alertname="HighErrorRate", severity="critical", service="api"}
Root route: group_by [alertname, severity]
├── match severity=critical → pagerduty-critical ✓
├── match severity=warning → slack-warnings
└── match_re service=db|redis → database-team

The first matching route wins (routes are evaluated top to bottom).

Alerts with the same group_by labels are combined into a single notification:

group_by: [alertname, severity]

If 10 instances fire HighErrorRate, you get one Slack message listing all 10, not 10 separate messages.

Temporarily suppress alerts (e.g. during maintenance):

Terminal window
# Via Alertmanager UI (http://alertmanager:9093/#/silences)
# Or via API:
amtool silence add alertname=HighErrorRate --duration=2h --comment="Planned maintenance"

Suppress lower-priority alerts when a higher-priority alert is firing:

inhibit_rules:
- source_match:
severity: critical
target_match:
severity: warning
equal: [alertname, instance]

If HighErrorRate is firing as critical, the warning version is suppressed.


Grafana has its own alerting system that can query any data source (not just Prometheus):

  1. Go to Alerting → Alert Rules → New Alert Rule.
  2. Choose a data source and write a query.
  3. Set conditions (e.g. “when avg() of query A is above 0.05”).
  4. Set evaluation interval and pending period.
  5. Add labels and annotations.
  6. Assign to a notification policy.

Where notifications go:

  • Slack, PagerDuty, OpsGenie, email, webhook, Microsoft Teams, Telegram, etc.

Routing rules (similar to Alertmanager):

Default policy → #alerts Slack channel
└── severity=critical → PagerDuty
└── team=database → db-team@example.com

When to Use Grafana Alerting vs Alertmanager

Section titled “When to Use Grafana Alerting vs Alertmanager”
AlertmanagerGrafana Alerting
Data sourcePrometheus onlyAny Grafana data source
ConfigYAML files (GitOps-friendly)UI or provisioning API
Grouping/routingVery flexibleGood, improving
IntegrationNative with PrometheusWorks with any backend

Many teams use both: Alertmanager for Prometheus metrics, Grafana alerting for Loki logs or multi-source alerts.


# Good: alert on what the user experiences
- alert: HighErrorRate
expr: error_rate > 0.05
# Avoid: alert on internal cause (too noisy, might not affect users)
- alert: HighCpuUsage
expr: node_cpu_usage > 0.90

High CPU doesn’t always mean a problem. High error rate always does.

for: 5m # good — avoids brief spikes triggering alerts
for: 0s # bad — fires on every momentary blip
for: 1h # bad — too slow to be useful
SeverityMeaningAction
criticalUser-facing impact right nowPage on-call (PagerDuty)
warningDegraded but not broken, or will become critical soonSlack notification
infoFYI, no action neededDashboard or log only

Every alert should have a clear answer to: “What do I do when this fires?”

Include in annotations:

annotations:
summary: "Disk space below 10% on {{ $labels.instance }}"
runbook_url: "https://wiki.example.com/runbooks/disk-space"
description: "Current: {{ $value | humanizePercentage }}. Check /var/log for large files."
  • Too many alerts → People ignore them all. Only alert on things that need human action.
  • Group related alerts → Use group_by to bundle similar alerts.
  • Use inhibition → Don’t page for warnings when critical is already firing.
  • Review alerts quarterly → Delete alerts that never fire or always fire.
  • Prometheus alert rules fire when PromQL expressions are true for a for duration.
  • Alertmanager routes, groups, and delivers notifications. Configure routing by severity/team/service.
  • Grafana alerting can query any data source — use it for logs, multi-source alerts, or non-Prometheus backends.
  • Alert on symptoms (error rate, latency), not causes (CPU, memory).
  • Every alert should be actionable — include a runbook URL and clear description.
  • Fight alert fatigue with grouping, inhibition, appropriate severity levels, and regular reviews.