--- name: designing-alerts description: Use when creating, reviewing, or debugging Prometheus/Grafana alert rules - when writing PromQL for alerts, choosing thresholds, deciding alert severity, writing PrometheusRule CRDs, or evaluating whether something should be an alert at all. --- # Designing Alerts ## Overview Bad alerts are worse than no alerts — they cause alert fatigue and get ignored. Every alert must be actionable, symptom-based, and backed by real threshold data. **Stack:** Mimir (datasource UID `mimir`) · Grafana at `grafana.monitoring.ctz.fyi` · Grafana alerting · PrometheusRule CRDs ## Cardinal Rules 1. **Actionable or bust** — if you can't do something about it right now, it's a dashboard, not an alert 2. **Symptoms, not causes** — "users can't reach service" > "CPU is high" > "pod restarted" 3. **Rates, not raw values** — `rate(errors[5m]) > 0.01` not `errors_total > 100` 4. **Always add `for:`** — minimum 2–5 minutes; eliminates transient spikes 5. **Every alert needs a runbook** — `annotations.runbook_url` or at minimum a useful `description` 6. **Test your thresholds** — check p99 of historical data in Grafana Explore before picking a number ## Severity Levels | Severity | Meaning | Response | |---|---|---| | `critical` | User-facing impact, wake someone up | Immediate | | `warning` | Degraded but not down | Investigate within hours | | `info` | FYI, no action required | Prefer dashboards instead | ## Workflow ``` 1. Identify failure modes that matter for this service 2. Find the right metric (check dashboards, Explore, service docs) 3. Write PromQL — test in Grafana Explore using historical data 4. Pick threshold from p99 of normal values (not intuition) 5. Set for: duration (never < 2m) 6. Write description: what broke + current value + what to do first 7. Add runbook_url or BookStack link 8. Deploy as PrometheusRule CRD (preferred) or via Grafana UI 9. Verify alert appears, fires, and resolves correctly ``` ## PrometheusRule CRD Pattern ```yaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: -alerts namespace: labels: prometheus: kube-prometheus role: alert-rules spec: groups: - name: .rules interval: 60s rules: - alert: ServiceDown expr: up{job=""} == 0 for: 5m labels: severity: critical team: infra annotations: summary: "{{ $labels.instance }} is down" description: "Service {{ $labels.job }} on {{ $labels.instance }} has been down > 5m. Check pod logs and events." runbook_url: "https://wiki.ctz.fyi/books/ansiblestack/page/runbook-" ``` ## Common Alert Patterns ```yaml # Service availability - alert: ServiceUnreachable expr: up{job=~".*"} == 0 for: 5m labels: {severity: critical} # High error rate (5% for 5m) - alert: HighErrorRate expr: | rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05 for: 5m labels: {severity: critical} # Pod crash looping - alert: PodCrashLooping expr: rate(kube_pod_container_status_restarts_total[15m]) > 0 for: 5m labels: {severity: warning} # Node memory pressure - alert: NodeMemoryPressure expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.90 for: 10m labels: {severity: warning} # Disk space - alert: DiskSpaceLow expr: | (1 - node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes{fstype!="tmpfs"}) > 0.85 for: 15m labels: {severity: warning} # Certificate expiry - alert: CertificateExpiringSoon expr: certmanager_certificate_expiration_timestamp_seconds - time() < 7 * 24 * 3600 for: 1h labels: {severity: critical} # OpenBao sealed - alert: OpenBaoSealed expr: vault_core_unsealed == 0 for: 2m labels: {severity: critical} ``` ## SLO-Based Alerting (Advanced) For a 99.9% SLO (0.1% error budget): ```yaml # Fast burn: consuming budget 14x faster than sustainable - alert: SLOBurnRateFast expr: | (rate(requests_total{status=~"5.."}[1h]) / rate(requests_total[1h])) > 14 * 0.001 for: 5m labels: {severity: critical} annotations: description: "Error budget burning 14x too fast. 1h rate: {{ $value | humanizePercentage }}" # Slow burn: will exhaust budget in ~3 days - alert: SLOBurnRateSlow expr: | (rate(requests_total{status=~"5.."}[6h]) / rate(requests_total[6h])) > 2 * 0.001 for: 30m labels: {severity: warning} ``` ## Anti-Patterns | ❌ Bad | ✅ Better | |---|---| | `cpu_usage > 80` | CPU sustained high AND latency degraded | | `pod_restarts > 0` | `rate(restarts[15m]) > 0` with `for: 5m` | | No `for:` duration | Always add `for:`, minimum 2m | | `severity: critical` on everything | Reserve critical for user-facing impact | | "high X" with no context | What's normal? What's the impact? What to do? | | Fires in staging/dev | Add `env="production"` label filter | | Alert for every metric | Not everything needs an alert; use dashboards | ## Writing Good Descriptions Template: **"[What broke] on [where]. Current value: {{ $value }}. [What to check first]."** ```yaml # ❌ Bad description: "High error rate detected" # ✅ Good description: "Error rate on {{ $labels.job }} is {{ $value | humanizePercentage }} (threshold: 5%). Check recent deployments and downstream dependencies. Logs: kubectl logs -n {{ $labels.namespace }} -l app={{ $labels.job }} --tail=100" ```