ci/woodpecker/push/woodpecker Pipeline failed

Details

fix: use library/ Harbor project, add skills, fix pipeline secrets

- .woodpecker.yaml: image paths -> library/autojanet-{agent,dispatcher}
- .woodpecker.yaml: secret names RS_HARBOR_USER / RS_HARBOR_PASS (global)
- container/Dockerfile: restore COPY skills/, skills/ populated from opencode config
- skills/: 84 opencode skills bundled into image
- k8s/manifests: update image refs to library/

2026-05-30 15:43:14 -07:00

5.4 KiB

Raw Blame History

name	description
designing-alerts	Use when creating, reviewing, or debugging Prometheus/Grafana alert rules - when writing PromQL for alerts, choosing thresholds, deciding alert severity, writing PrometheusRule CRDs, or evaluating whether something should be an alert at all.

Designing Alerts

Overview

Bad alerts are worse than no alerts — they cause alert fatigue and get ignored. Every alert must be actionable, symptom-based, and backed by real threshold data.

Stack: Mimir (datasource UID mimir) · Grafana at grafana.monitoring.ctz.fyi · Grafana alerting · PrometheusRule CRDs

Cardinal Rules

Actionable or bust — if you can't do something about it right now, it's a dashboard, not an alert
Symptoms, not causes — "users can't reach service" > "CPU is high" > "pod restarted"
Rates, not raw values — rate(errors[5m]) > 0.01 not errors_total > 100
Always add for: — minimum 2–5 minutes; eliminates transient spikes
Every alert needs a runbook — annotations.runbook_url or at minimum a useful description
Test your thresholds — check p99 of historical data in Grafana Explore before picking a number

Severity Levels

Severity	Meaning	Response
`critical`	User-facing impact, wake someone up	Immediate
`warning`	Degraded but not down	Investigate within hours
`info`	FYI, no action required	Prefer dashboards instead

Workflow

1. Identify failure modes that matter for this service
2. Find the right metric (check dashboards, Explore, service docs)
3. Write PromQL — test in Grafana Explore using historical data
4. Pick threshold from p99 of normal values (not intuition)
5. Set for: duration (never < 2m)
6. Write description: what broke + current value + what to do first
7. Add runbook_url or BookStack link
8. Deploy as PrometheusRule CRD (preferred) or via Grafana UI
9. Verify alert appears, fires, and resolves correctly

PrometheusRule CRD Pattern

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: <service>-alerts
  namespace: <namespace>
  labels:
    prometheus: kube-prometheus
    role: alert-rules
spec:
  groups:
    - name: <service>.rules
      interval: 60s
      rules:
        - alert: ServiceDown
          expr: up{job="<service>"} == 0
          for: 5m
          labels:
            severity: critical
            team: infra
          annotations:
            summary: "{{ $labels.instance }} is down"
            description: "Service {{ $labels.job }} on {{ $labels.instance }} has been down > 5m. Check pod logs and events."
            runbook_url: "https://wiki.ctz.fyi/books/ansiblestack/page/runbook-<service>"

Common Alert Patterns

# Service availability
- alert: ServiceUnreachable
  expr: up{job=~"<service>.*"} == 0
  for: 5m
  labels: {severity: critical}

# High error rate (5% for 5m)
- alert: HighErrorRate
  expr: |
    rate(http_requests_total{status=~"5.."}[5m])
    / rate(http_requests_total[5m]) > 0.05    
  for: 5m
  labels: {severity: critical}

# Pod crash looping
- alert: PodCrashLooping
  expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
  for: 5m
  labels: {severity: warning}

# Node memory pressure
- alert: NodeMemoryPressure
  expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.90
  for: 10m
  labels: {severity: warning}

# Disk space
- alert: DiskSpaceLow
  expr: |
    (1 - node_filesystem_avail_bytes{fstype!="tmpfs"}
      / node_filesystem_size_bytes{fstype!="tmpfs"}) > 0.85    
  for: 15m
  labels: {severity: warning}

# Certificate expiry
- alert: CertificateExpiringSoon
  expr: certmanager_certificate_expiration_timestamp_seconds - time() < 7 * 24 * 3600
  for: 1h
  labels: {severity: critical}

# OpenBao sealed
- alert: OpenBaoSealed
  expr: vault_core_unsealed == 0
  for: 2m
  labels: {severity: critical}

SLO-Based Alerting (Advanced)

For a 99.9% SLO (0.1% error budget):

# Fast burn: consuming budget 14x faster than sustainable
- alert: SLOBurnRateFast
  expr: |
    (rate(requests_total{status=~"5.."}[1h])
    / rate(requests_total[1h])) > 14 * 0.001    
  for: 5m
  labels: {severity: critical}
  annotations:
    description: "Error budget burning 14x too fast. 1h rate: {{ $value | humanizePercentage }}"

# Slow burn: will exhaust budget in ~3 days
- alert: SLOBurnRateSlow
  expr: |
    (rate(requests_total{status=~"5.."}[6h])
    / rate(requests_total[6h])) > 2 * 0.001    
  for: 30m
  labels: {severity: warning}

Anti-Patterns

❌ Bad	✅ Better
`cpu_usage > 80`	CPU sustained high AND latency degraded
`pod_restarts > 0`	`rate(restarts[15m]) > 0` with `for: 5m`
No `for:` duration	Always add `for:`, minimum 2m
`severity: critical` on everything	Reserve critical for user-facing impact
"high X" with no context	What's normal? What's the impact? What to do?
Fires in staging/dev	Add `env="production"` label filter
Alert for every metric	Not everything needs an alert; use dashboards

Writing Good Descriptions

Template: "[What broke] on [where]. Current value: {{ $value }}. [What to check first]."

# ❌ Bad
description: "High error rate detected"

# ✅ Good
description: "Error rate on {{ $labels.job }} is {{ $value | humanizePercentage }}
  (threshold: 5%). Check recent deployments and downstream dependencies.
  Logs: kubectl logs -n {{ $labels.namespace }} -l app={{ $labels.job }} --tail=100"

5.4 KiB Raw Blame History Unescape Escape