autojanet/skills/designing-alerts/SKILL.md
Zoë cc74ad0bd0
Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
fix: use library/ Harbor project, add skills, fix pipeline secrets
- .woodpecker.yaml: image paths -> library/autojanet-{agent,dispatcher}
- .woodpecker.yaml: secret names RS_HARBOR_USER / RS_HARBOR_PASS (global)
- container/Dockerfile: restore COPY skills/, skills/ populated from opencode config
- skills/: 84 opencode skills bundled into image
- k8s/manifests: update image refs to library/
2026-05-30 15:43:14 -07:00

5.4 KiB
Raw Blame History

name description
designing-alerts Use when creating, reviewing, or debugging Prometheus/Grafana alert rules - when writing PromQL for alerts, choosing thresholds, deciding alert severity, writing PrometheusRule CRDs, or evaluating whether something should be an alert at all.

Designing Alerts

Overview

Bad alerts are worse than no alerts — they cause alert fatigue and get ignored. Every alert must be actionable, symptom-based, and backed by real threshold data.

Stack: Mimir (datasource UID mimir) · Grafana at grafana.monitoring.ctz.fyi · Grafana alerting · PrometheusRule CRDs

Cardinal Rules

  1. Actionable or bust — if you can't do something about it right now, it's a dashboard, not an alert
  2. Symptoms, not causes — "users can't reach service" > "CPU is high" > "pod restarted"
  3. Rates, not raw valuesrate(errors[5m]) > 0.01 not errors_total > 100
  4. Always add for: — minimum 25 minutes; eliminates transient spikes
  5. Every alert needs a runbookannotations.runbook_url or at minimum a useful description
  6. Test your thresholds — check p99 of historical data in Grafana Explore before picking a number

Severity Levels

Severity Meaning Response
critical User-facing impact, wake someone up Immediate
warning Degraded but not down Investigate within hours
info FYI, no action required Prefer dashboards instead

Workflow

1. Identify failure modes that matter for this service
2. Find the right metric (check dashboards, Explore, service docs)
3. Write PromQL — test in Grafana Explore using historical data
4. Pick threshold from p99 of normal values (not intuition)
5. Set for: duration (never < 2m)
6. Write description: what broke + current value + what to do first
7. Add runbook_url or BookStack link
8. Deploy as PrometheusRule CRD (preferred) or via Grafana UI
9. Verify alert appears, fires, and resolves correctly

PrometheusRule CRD Pattern

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: <service>-alerts
  namespace: <namespace>
  labels:
    prometheus: kube-prometheus
    role: alert-rules
spec:
  groups:
    - name: <service>.rules
      interval: 60s
      rules:
        - alert: ServiceDown
          expr: up{job="<service>"} == 0
          for: 5m
          labels:
            severity: critical
            team: infra
          annotations:
            summary: "{{ $labels.instance }} is down"
            description: "Service {{ $labels.job }} on {{ $labels.instance }} has been down > 5m. Check pod logs and events."
            runbook_url: "https://wiki.ctz.fyi/books/ansiblestack/page/runbook-<service>"

Common Alert Patterns

# Service availability
- alert: ServiceUnreachable
  expr: up{job=~"<service>.*"} == 0
  for: 5m
  labels: {severity: critical}

# High error rate (5% for 5m)
- alert: HighErrorRate
  expr: |
    rate(http_requests_total{status=~"5.."}[5m])
    / rate(http_requests_total[5m]) > 0.05    
  for: 5m
  labels: {severity: critical}

# Pod crash looping
- alert: PodCrashLooping
  expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
  for: 5m
  labels: {severity: warning}

# Node memory pressure
- alert: NodeMemoryPressure
  expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.90
  for: 10m
  labels: {severity: warning}

# Disk space
- alert: DiskSpaceLow
  expr: |
    (1 - node_filesystem_avail_bytes{fstype!="tmpfs"}
      / node_filesystem_size_bytes{fstype!="tmpfs"}) > 0.85    
  for: 15m
  labels: {severity: warning}

# Certificate expiry
- alert: CertificateExpiringSoon
  expr: certmanager_certificate_expiration_timestamp_seconds - time() < 7 * 24 * 3600
  for: 1h
  labels: {severity: critical}

# OpenBao sealed
- alert: OpenBaoSealed
  expr: vault_core_unsealed == 0
  for: 2m
  labels: {severity: critical}

SLO-Based Alerting (Advanced)

For a 99.9% SLO (0.1% error budget):

# Fast burn: consuming budget 14x faster than sustainable
- alert: SLOBurnRateFast
  expr: |
    (rate(requests_total{status=~"5.."}[1h])
    / rate(requests_total[1h])) > 14 * 0.001    
  for: 5m
  labels: {severity: critical}
  annotations:
    description: "Error budget burning 14x too fast. 1h rate: {{ $value | humanizePercentage }}"

# Slow burn: will exhaust budget in ~3 days
- alert: SLOBurnRateSlow
  expr: |
    (rate(requests_total{status=~"5.."}[6h])
    / rate(requests_total[6h])) > 2 * 0.001    
  for: 30m
  labels: {severity: warning}

Anti-Patterns

Bad Better
cpu_usage > 80 CPU sustained high AND latency degraded
pod_restarts > 0 rate(restarts[15m]) > 0 with for: 5m
No for: duration Always add for:, minimum 2m
severity: critical on everything Reserve critical for user-facing impact
"high X" with no context What's normal? What's the impact? What to do?
Fires in staging/dev Add env="production" label filter
Alert for every metric Not everything needs an alert; use dashboards

Writing Good Descriptions

Template: "[What broke] on [where]. Current value: {{ $value }}. [What to check first]."

# ❌ Bad
description: "High error rate detected"

# ✅ Good
description: "Error rate on {{ $labels.job }} is {{ $value | humanizePercentage }}
  (threshold: 5%). Check recent deployments and downstream dependencies.
  Logs: kubectl logs -n {{ $labels.namespace }} -l app={{ $labels.job }} --tail=100"