Alerting

First PublishedFeb 16, 2026Last UpdatedMar 31, 2026ByAtif Alam

Alerting closes the loop: metrics tell you something is wrong, and alerts notify the right people before users notice.

Prometheus Alert Rules

Alert rules are defined in Prometheus and evaluated against live metrics:

1
groups:
2
  - name: http_alerts
3
    rules:
4
      - alert: HighErrorRate
5
        expr: |
6
          sum(rate(http_requests_total{status=~"5.."}[5m]))
7
            /
8
          sum(rate(http_requests_total[5m]))
9
            > 0.05
10
        for: 5m
11
        labels:
12
          severity: critical
13
        annotations:
14
          summary: "High error rate ({{ $value | humanizePercentage }})"
15
          description: "More than 5% of requests are failing for the last 5 minutes."
16

17
      - alert: HighLatency
18
        expr: |
19
          histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
20
            > 1.0
21
        for: 10m
22
        labels:
23
          severity: warning
24
        annotations:
25
          summary: "95th percentile latency above 1s"

expr — PromQL expression. When it returns results, the alert fires.
for — How long the condition must be true before the alert fires (avoids flapping).
labels — Metadata for routing (severity, team, service).
annotations — Human-readable description with template variables.

Load rules in prometheus.yml:

1
rule_files:
2
  - "rules/*.yml"

Alert States

State	Meaning
Inactive	Expression returns nothing — all clear
Pending	Expression is true, but `for` duration hasn’t elapsed yet
Firing	Expression has been true for the `for` duration — alert is sent to Alertmanager

Alertmanager

Alertmanager receives alerts from Prometheus and handles routing, grouping, silencing, and delivering notifications.

Configuration

1
global:
2
  resolve_timeout: 5m
3

4
route:
5
  receiver: default-slack
6
  group_by: [alertname, severity]
7
  group_wait: 30s           # wait before sending the first notification
8
  group_interval: 5m        # wait between notifications for new alerts in the same group
9
  repeat_interval: 4h       # re-send if alert is still firing
10

11
  routes:
12
    - match:
13
        severity: critical
14
      receiver: pagerduty-critical
15

16
    - match:
17
        severity: warning
18
      receiver: slack-warnings
19

20
    - match_re:
21
        service: "db|redis"
22
      receiver: database-team
23

24
receivers:
25
  - name: default-slack
26
    slack_configs:
27
      - api_url: "https://hooks.slack.com/services/XXX/YYY/ZZZ"
28
        channel: "#alerts"
29
        title: '{{ .GroupLabels.alertname }}'
30
        text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
31

32
  - name: pagerduty-critical
33
    pagerduty_configs:
34
      - service_key: "your-pagerduty-key"
35

36
  - name: slack-warnings
37
    slack_configs:
38
      - api_url: "https://hooks.slack.com/services/XXX/YYY/ZZZ"
39
        channel: "#warnings"
40

41
  - name: database-team
42
    email_configs:
43
      - to: "db-team@example.com"

Routing Flow

1
Alert comes in with labels: {alertname="HighErrorRate", severity="critical", service="api"}
2
       │
3
       ▼
4
Root route: group_by [alertname, severity]
5
       │
6
       ├── match severity=critical  →  pagerduty-critical  ✓
7
       ├── match severity=warning   →  slack-warnings
8
       └── match_re service=db|redis → database-team

The first matching route wins (routes are evaluated top to bottom).

Grouping

Alerts with the same group_by labels are combined into a single notification:

1
group_by: [alertname, severity]

If 10 instances fire HighErrorRate, you get one Slack message listing all 10, not 10 separate messages.

Silences

Temporarily suppress alerts (e.g. during maintenance):

1
# Via Alertmanager UI (http://alertmanager:9093/#/silences)
2
# Or via API:
3
amtool silence add alertname=HighErrorRate --duration=2h --comment="Planned maintenance"

Inhibition

Suppress lower-priority alerts when a higher-priority alert is firing:

1
inhibit_rules:
2
  - source_match:
3
      severity: critical
4
    target_match:
5
      severity: warning
6
    equal: [alertname, instance]

If HighErrorRate is firing as critical, the warning version is suppressed.

Grafana Alerting (Unified Alerting)

Grafana has its own alerting system that can query any data source (not just Prometheus):

Alert Rules in Grafana

Go to Alerting → Alert Rules → New Alert Rule.
Choose a data source and write a query.
Set conditions (e.g. “when avg() of query A is above 0.05”).
Set evaluation interval and pending period.
Add labels and annotations.
Assign to a notification policy.

Contact Points

Where notifications go:

Slack, PagerDuty, OpsGenie, email, webhook, Microsoft Teams, Telegram, etc.

Notification Policies

Routing rules (similar to Alertmanager):

1
Default policy → #alerts Slack channel
2
  └── severity=critical → PagerDuty
3
  └── team=database → db-team@example.com

When to Use Grafana Alerting vs Alertmanager

	Alertmanager	Grafana Alerting
Data source	Prometheus only	Any Grafana data source
Config	YAML files (GitOps-friendly)	UI or provisioning API
Grouping/routing	Very flexible	Good, improving
Integration	Native with Prometheus	Works with any backend

Many teams use both: Alertmanager for Prometheus metrics, Grafana alerting for Loki logs or multi-source alerts.

Alert Design Best Practices

Alert on Symptoms, Not Causes

1
# Good: alert on what the user experiences
2
- alert: HighErrorRate
3
  expr: error_rate > 0.05
4

5
# Avoid: alert on internal cause (too noisy, might not affect users)
6
- alert: HighCpuUsage
7
  expr: node_cpu_usage > 0.90

High CPU doesn’t always mean a problem. High error rate always does.

Set Meaningful `for` Durations

1
for: 5m     # good — avoids brief spikes triggering alerts
2
for: 0s     # bad — fires on every momentary blip
3
for: 1h     # bad — too slow to be useful

Use Severity Levels

Severity	Meaning	Action
`critical`	User-facing impact right now	Page on-call (PagerDuty)
`warning`	Degraded but not broken, or will become critical soon	Slack notification
`info`	FYI, no action needed	Dashboard or log only

Keep Alerts Actionable

Every alert should have a clear answer to: “What do I do when this fires?”

Include in annotations:

1
annotations:
2
  summary: "Disk space below 10% on {{ $labels.instance }}"
3
  runbook_url: "https://wiki.example.com/runbooks/disk-space"
4
  description: "Current: {{ $value | humanizePercentage }}. Check /var/log for large files."

Avoid Alert Fatigue

Too many alerts → People ignore them all. Only alert on things that need human action.
Group related alerts → Use group_by to bundle similar alerts.
Use inhibition → Don’t page for warnings when critical is already firing.
Review alerts quarterly → Delete alerts that never fire or always fire.

Key Takeaways

Prometheus alert rules fire when PromQL expressions are true for a for duration.
Alertmanager routes, groups, and delivers notifications. Configure routing by severity/team/service.
Grafana alerting can query any data source — use it for logs, multi-source alerts, or non-Prometheus backends.
Alert on symptoms (error rate, latency), not causes (CPU, memory).
Every alert should be actionable — include a runbook URL and clear description.
Fight alert fatigue with grouping, inhibition, appropriate severity levels, and regular reviews.

SLOs, SLIs, and error budgets — tie alerts to agreed targets and budget burn.
Incident response and on-call — escalation, handoff, and sustainable paging.