Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
- .woodpecker.yaml: image paths -> library/autojanet-{agent,dispatcher}
- .woodpecker.yaml: secret names RS_HARBOR_USER / RS_HARBOR_PASS (global)
- container/Dockerfile: restore COPY skills/, skills/ populated from opencode config
- skills/: 84 opencode skills bundled into image
- k8s/manifests: update image refs to library/
5.4 KiB
5.4 KiB
| name | description |
|---|---|
| designing-alerts | Use when creating, reviewing, or debugging Prometheus/Grafana alert rules - when writing PromQL for alerts, choosing thresholds, deciding alert severity, writing PrometheusRule CRDs, or evaluating whether something should be an alert at all. |
Designing Alerts
Overview
Bad alerts are worse than no alerts — they cause alert fatigue and get ignored. Every alert must be actionable, symptom-based, and backed by real threshold data.
Stack: Mimir (datasource UID mimir) · Grafana at grafana.monitoring.ctz.fyi · Grafana alerting · PrometheusRule CRDs
Cardinal Rules
- Actionable or bust — if you can't do something about it right now, it's a dashboard, not an alert
- Symptoms, not causes — "users can't reach service" > "CPU is high" > "pod restarted"
- Rates, not raw values —
rate(errors[5m]) > 0.01noterrors_total > 100 - Always add
for:— minimum 2–5 minutes; eliminates transient spikes - Every alert needs a runbook —
annotations.runbook_urlor at minimum a usefuldescription - Test your thresholds — check p99 of historical data in Grafana Explore before picking a number
Severity Levels
| Severity | Meaning | Response |
|---|---|---|
critical |
User-facing impact, wake someone up | Immediate |
warning |
Degraded but not down | Investigate within hours |
info |
FYI, no action required | Prefer dashboards instead |
Workflow
1. Identify failure modes that matter for this service
2. Find the right metric (check dashboards, Explore, service docs)
3. Write PromQL — test in Grafana Explore using historical data
4. Pick threshold from p99 of normal values (not intuition)
5. Set for: duration (never < 2m)
6. Write description: what broke + current value + what to do first
7. Add runbook_url or BookStack link
8. Deploy as PrometheusRule CRD (preferred) or via Grafana UI
9. Verify alert appears, fires, and resolves correctly
PrometheusRule CRD Pattern
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: <service>-alerts
namespace: <namespace>
labels:
prometheus: kube-prometheus
role: alert-rules
spec:
groups:
- name: <service>.rules
interval: 60s
rules:
- alert: ServiceDown
expr: up{job="<service>"} == 0
for: 5m
labels:
severity: critical
team: infra
annotations:
summary: "{{ $labels.instance }} is down"
description: "Service {{ $labels.job }} on {{ $labels.instance }} has been down > 5m. Check pod logs and events."
runbook_url: "https://wiki.ctz.fyi/books/ansiblestack/page/runbook-<service>"
Common Alert Patterns
# Service availability
- alert: ServiceUnreachable
expr: up{job=~"<service>.*"} == 0
for: 5m
labels: {severity: critical}
# High error rate (5% for 5m)
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m]) > 0.05
for: 5m
labels: {severity: critical}
# Pod crash looping
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels: {severity: warning}
# Node memory pressure
- alert: NodeMemoryPressure
expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.90
for: 10m
labels: {severity: warning}
# Disk space
- alert: DiskSpaceLow
expr: |
(1 - node_filesystem_avail_bytes{fstype!="tmpfs"}
/ node_filesystem_size_bytes{fstype!="tmpfs"}) > 0.85
for: 15m
labels: {severity: warning}
# Certificate expiry
- alert: CertificateExpiringSoon
expr: certmanager_certificate_expiration_timestamp_seconds - time() < 7 * 24 * 3600
for: 1h
labels: {severity: critical}
# OpenBao sealed
- alert: OpenBaoSealed
expr: vault_core_unsealed == 0
for: 2m
labels: {severity: critical}
SLO-Based Alerting (Advanced)
For a 99.9% SLO (0.1% error budget):
# Fast burn: consuming budget 14x faster than sustainable
- alert: SLOBurnRateFast
expr: |
(rate(requests_total{status=~"5.."}[1h])
/ rate(requests_total[1h])) > 14 * 0.001
for: 5m
labels: {severity: critical}
annotations:
description: "Error budget burning 14x too fast. 1h rate: {{ $value | humanizePercentage }}"
# Slow burn: will exhaust budget in ~3 days
- alert: SLOBurnRateSlow
expr: |
(rate(requests_total{status=~"5.."}[6h])
/ rate(requests_total[6h])) > 2 * 0.001
for: 30m
labels: {severity: warning}
Anti-Patterns
| ❌ Bad | ✅ Better |
|---|---|
cpu_usage > 80 |
CPU sustained high AND latency degraded |
pod_restarts > 0 |
rate(restarts[15m]) > 0 with for: 5m |
No for: duration |
Always add for:, minimum 2m |
severity: critical on everything |
Reserve critical for user-facing impact |
| "high X" with no context | What's normal? What's the impact? What to do? |
| Fires in staging/dev | Add env="production" label filter |
| Alert for every metric | Not everything needs an alert; use dashboards |
Writing Good Descriptions
Template: "[What broke] on [where]. Current value: {{ $value }}. [What to check first]."
# ❌ Bad
description: "High error rate detected"
# ✅ Good
description: "Error rate on {{ $labels.job }} is {{ $value | humanizePercentage }}
(threshold: 5%). Check recent deployments and downstream dependencies.
Logs: kubectl logs -n {{ $labels.namespace }} -l app={{ $labels.job }} --tail=100"