Alerting
Alerting closes the loop: metrics tell you something is wrong, and alerts notify the right people before users notice.
Prometheus Alert Rules
Section titled “Prometheus Alert Rules”Alert rules are defined in Prometheus and evaluated against live metrics:
groups: - name: http_alerts rules: - alert: HighErrorRate expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate ({{ $value | humanizePercentage }})" description: "More than 5% of requests are failing for the last 5 minutes."
- alert: HighLatency expr: | histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m]))) > 1.0 for: 10m labels: severity: warning annotations: summary: "95th percentile latency above 1s"expr— PromQL expression. When it returns results, the alert fires.for— How long the condition must be true before the alert fires (avoids flapping).labels— Metadata for routing (severity, team, service).annotations— Human-readable description with template variables.
Load rules in prometheus.yml:
rule_files: - "rules/*.yml"Alert States
Section titled “Alert States”| State | Meaning |
|---|---|
| Inactive | Expression returns nothing — all clear |
| Pending | Expression is true, but for duration hasn’t elapsed yet |
| Firing | Expression has been true for the for duration — alert is sent to Alertmanager |
Alertmanager
Section titled “Alertmanager”Alertmanager receives alerts from Prometheus and handles routing, grouping, silencing, and delivering notifications.
Configuration
Section titled “Configuration”global: resolve_timeout: 5m
route: receiver: default-slack group_by: [alertname, severity] group_wait: 30s # wait before sending the first notification group_interval: 5m # wait between notifications for new alerts in the same group repeat_interval: 4h # re-send if alert is still firing
routes: - match: severity: critical receiver: pagerduty-critical
- match: severity: warning receiver: slack-warnings
- match_re: service: "db|redis" receiver: database-team
receivers: - name: default-slack slack_configs: - api_url: "https://hooks.slack.com/services/XXX/YYY/ZZZ" channel: "#alerts" title: '{{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
- name: pagerduty-critical pagerduty_configs: - service_key: "your-pagerduty-key"
- name: slack-warnings slack_configs: - api_url: "https://hooks.slack.com/services/XXX/YYY/ZZZ" channel: "#warnings"
- name: database-team email_configs: - to: "db-team@example.com"Routing Flow
Section titled “Routing Flow”Alert comes in with labels: {alertname="HighErrorRate", severity="critical", service="api"} │ ▼Root route: group_by [alertname, severity] │ ├── match severity=critical → pagerduty-critical ✓ ├── match severity=warning → slack-warnings └── match_re service=db|redis → database-teamThe first matching route wins (routes are evaluated top to bottom).
Grouping
Section titled “Grouping”Alerts with the same group_by labels are combined into a single notification:
group_by: [alertname, severity]If 10 instances fire HighErrorRate, you get one Slack message listing all 10, not 10 separate messages.
Silences
Section titled “Silences”Temporarily suppress alerts (e.g. during maintenance):
# Via Alertmanager UI (http://alertmanager:9093/#/silences)# Or via API:amtool silence add alertname=HighErrorRate --duration=2h --comment="Planned maintenance"Inhibition
Section titled “Inhibition”Suppress lower-priority alerts when a higher-priority alert is firing:
inhibit_rules: - source_match: severity: critical target_match: severity: warning equal: [alertname, instance]If HighErrorRate is firing as critical, the warning version is suppressed.
Grafana Alerting (Unified Alerting)
Section titled “Grafana Alerting (Unified Alerting)”Grafana has its own alerting system that can query any data source (not just Prometheus):
Alert Rules in Grafana
Section titled “Alert Rules in Grafana”- Go to Alerting → Alert Rules → New Alert Rule.
- Choose a data source and write a query.
- Set conditions (e.g. “when avg() of query A is above 0.05”).
- Set evaluation interval and pending period.
- Add labels and annotations.
- Assign to a notification policy.
Contact Points
Section titled “Contact Points”Where notifications go:
- Slack, PagerDuty, OpsGenie, email, webhook, Microsoft Teams, Telegram, etc.
Notification Policies
Section titled “Notification Policies”Routing rules (similar to Alertmanager):
Default policy → #alerts Slack channel └── severity=critical → PagerDuty └── team=database → db-team@example.comWhen to Use Grafana Alerting vs Alertmanager
Section titled “When to Use Grafana Alerting vs Alertmanager”| Alertmanager | Grafana Alerting | |
|---|---|---|
| Data source | Prometheus only | Any Grafana data source |
| Config | YAML files (GitOps-friendly) | UI or provisioning API |
| Grouping/routing | Very flexible | Good, improving |
| Integration | Native with Prometheus | Works with any backend |
Many teams use both: Alertmanager for Prometheus metrics, Grafana alerting for Loki logs or multi-source alerts.
Alert Design Best Practices
Section titled “Alert Design Best Practices”Alert on Symptoms, Not Causes
Section titled “Alert on Symptoms, Not Causes”# Good: alert on what the user experiences- alert: HighErrorRate expr: error_rate > 0.05
# Avoid: alert on internal cause (too noisy, might not affect users)- alert: HighCpuUsage expr: node_cpu_usage > 0.90High CPU doesn’t always mean a problem. High error rate always does.
Set Meaningful for Durations
Section titled “Set Meaningful for Durations”for: 5m # good — avoids brief spikes triggering alertsfor: 0s # bad — fires on every momentary blipfor: 1h # bad — too slow to be usefulUse Severity Levels
Section titled “Use Severity Levels”| Severity | Meaning | Action |
|---|---|---|
critical | User-facing impact right now | Page on-call (PagerDuty) |
warning | Degraded but not broken, or will become critical soon | Slack notification |
info | FYI, no action needed | Dashboard or log only |
Keep Alerts Actionable
Section titled “Keep Alerts Actionable”Every alert should have a clear answer to: “What do I do when this fires?”
Include in annotations:
annotations: summary: "Disk space below 10% on {{ $labels.instance }}" runbook_url: "https://wiki.example.com/runbooks/disk-space" description: "Current: {{ $value | humanizePercentage }}. Check /var/log for large files."Avoid Alert Fatigue
Section titled “Avoid Alert Fatigue”- Too many alerts → People ignore them all. Only alert on things that need human action.
- Group related alerts → Use
group_byto bundle similar alerts. - Use inhibition → Don’t page for warnings when critical is already firing.
- Review alerts quarterly → Delete alerts that never fire or always fire.
Key Takeaways
Section titled “Key Takeaways”- Prometheus alert rules fire when PromQL expressions are true for a
forduration. - Alertmanager routes, groups, and delivers notifications. Configure routing by severity/team/service.
- Grafana alerting can query any data source — use it for logs, multi-source alerts, or non-Prometheus backends.
- Alert on symptoms (error rate, latency), not causes (CPU, memory).
- Every alert should be actionable — include a runbook URL and clear description.
- Fight alert fatigue with grouping, inhibition, appropriate severity levels, and regular reviews.
Related
Section titled “Related”- SLOs, SLIs, and error budgets — tie alerts to agreed targets and budget burn.
- Incident response and on-call — escalation, handoff, and sustainable paging.