autojanet/skills/incident-response/SKILL.md
Zoë cc74ad0bd0
Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
fix: use library/ Harbor project, add skills, fix pipeline secrets
- .woodpecker.yaml: image paths -> library/autojanet-{agent,dispatcher}
- .woodpecker.yaml: secret names RS_HARBOR_USER / RS_HARBOR_PASS (global)
- container/Dockerfile: restore COPY skills/, skills/ populated from opencode config
- skills/: 84 opencode skills bundled into image
- k8s/manifests: update image refs to library/
2026-05-30 15:43:14 -07:00

5.9 KiB
Raw Permalink Blame History

name description
incident-response Use when responding to production outages, data loss events, security incidents, or major service degradations in homelab (k3s/ansiblestack) or professional (AWS/EKS) environments. Applies at any severity — P1 complete outages to P4 minor issues.

Incident Response

Overview

Structured response for production incidents. Severity scales the rigor. Homelab P3 is not work P1.

Core principle: Stabilize user impact FIRST. Understand why SECOND. Never diagnose in silence.

Severity

Severity Definition Response SLA Examples
P1 Complete outage OR data loss OR security breach Immediate (minutes) Prod DB down, credentials leaked, all users blocked
P2 Major degradation, SLA at risk, significant user impact Urgent (< 30 min) 50%+ error rate, primary feature broken
P3 Partial degradation, workaround exists Same day One region/service slow, single feature broken
P4 Minor issue, no user impact Within days Monitoring gap, cosmetic issue

Phase 1: Triage (first 5-10 minutes)

Goal: confirm the incident, assess severity, start communication.

1. CONFIRM — is this actually broken?
   - Check from multiple locations/devices
   - Check AWS Status / DigitalOcean Status / upstream providers
   - Ask: is anyone else seeing this?

2. SCOPE — who/what is affected?
   - Which services? Which regions? Which users?
   - Is data being lost RIGHT NOW?
   - Stable or getting worse?

3. DECLARE — P1/P2: declare immediately, don't wait to diagnose
   - Work: post in incident channel, page on-call, open incident ticket
   - Homelab: create Vikunja task, start BookStack incident page

4. ASSIGN ROLES (work P1/P2)
   - Incident Commander: coordinates, communicates, makes calls
   - Tech Lead: root cause investigation
   - Comms Lead: stakeholder updates
   - (Homelab: you're all three)

Phase 2: Stabilize (before root cause)

Fix user impact first. Common actions:

# Roll back last deployment
kubectl rollout undo deployment/<name> -n <ns>

# Scale up healthy replicas
kubectl scale deploy/<name> --replicas=5 -n <ns>

# Check rollout history
kubectl rollout history deployment/<name> -n <ns>

Other mitigations:

  • Route traffic away from broken region/AZ
  • Disable the broken feature flag
  • Restore from backup (data loss)
  • Rotate credentials (security incident)

A rollback that takes 5 minutes beats a fix that takes 2 hours.

Phase 3: Investigate (root cause)

Now that users are unblocked:

# Recent events
kubectl get events -n <ns> --sort-by='.lastTimestamp' | tail -30

# Logs (kubectl)
kubectl logs -n <ns> deploy/<name> --since=1h

# Logs (Grafana Loki)
{namespace="<ns>"}

# Describe node for resource pressure
kubectl describe node <name>

For AWS: CloudTrail, CloudWatch Logs, ALB access logs, X-Ray traces.

Check Grafana Mimir for the anomaly timestamp — find the inflection point.

Phase 4: Resolve

  1. Deploy actual fix (not just the stabilization mitigation)
  2. Verify service is healthy — not just "pods are running":
    • Check error rates in Grafana
    • Check latency is normal
    • Spot-check actual user flows
  3. Monitor 15-30 minutes before declaring resolved

Phase 5: Communicate

During incident (P1/P2 — every 15-30 minutes):

[14:32 UTC] INCIDENT UPDATE — <service> degradation
Status: Investigating
Impact: <X users/services affected>
Last action: Rolled back deployment v1.2.3
Next update: 14:47 UTC

On resolution:

[15:10 UTC] RESOLVED — <service> is operational
Duration: 38 minutes (14:3215:10 UTC)
Root cause: <brief description>
Fix applied: <what was done>
Postmortem: <link or "to follow within 48h">

Work P1: never go silent for > 15 minutes. Communicate first, diagnose second.

Phase 6: Post-Incident

  • Within 24-48h: write postmortem (use writing-postmortem skill if available)
  • Update runbooks with anything that was missing
  • Create Vikunja tasks for action items
  • Save incident timeline to BookStack

Security Incidents: Extra Steps

Order matters — don't skip ahead:

  1. ISOLATE — kill or network-isolate the compromised resource before investigating
  2. PRESERVE — snapshot, export logs before destroying anything
  3. ROTATE — all potentially exposed credentials immediately
  4. NOTIFY — security team, CISO, legal as appropriate
  5. SCOPE before disclosing — do not announce publicly until you understand blast radius

GDPR: data breaches require regulatory notification within 72 hours.

Homelab Specifics

  • Create Vikunja task in relevant project when declaring
  • Document timeline in BookStack: Ansiblestack book → new page Incident YYYY-MM-DD: <title>
  • No stakeholder comms needed, but still write the postmortem — future-you will thank you

Common Homelab Incidents

Incident Quick fix
OpenBao sealed kubectl exec -n openbao openbao-0 -- bao status — should auto-unseal via OCI KMS; check OCI KMS key status if not
ArgoCD all apps OutOfSync Check Forgejo is reachable; check ArgoCD repo credentials
cert-manager not issuing Check DNS propagation; check DigitalOcean token; check cert-manager pod logs
NFS storage unavailable Check NFS server at 10.0.6.2; check pods in nfs-provisioner namespace
All pods evicted Node disk pressure — kubectl describe node <name>, check disk usage

Common Mistakes

Mistake Reality
Diagnosing in silence for 30+ minutes Communicate first, even with "investigating"
Fixing before declaring Declaration triggers backup/support; don't skip it
Declaring resolved before monitoring Check error rates and latency, not just pod status
Investigating before stabilizing Users are down while you read logs. Roll back first.
Skipping postmortem on homelab You will hit this again. Write it down.