--- name: incident-response description: Use when responding to production outages, data loss events, security incidents, or major service degradations in homelab (k3s/ansiblestack) or professional (AWS/EKS) environments. Applies at any severity — P1 complete outages to P4 minor issues. --- # Incident Response ## Overview Structured response for production incidents. Severity scales the rigor. Homelab P3 is not work P1. **Core principle:** Stabilize user impact FIRST. Understand why SECOND. Never diagnose in silence. ## Severity | Severity | Definition | Response SLA | Examples | |----------|------------|--------------|---------| | P1 | Complete outage OR data loss OR security breach | Immediate (minutes) | Prod DB down, credentials leaked, all users blocked | | P2 | Major degradation, SLA at risk, significant user impact | Urgent (< 30 min) | 50%+ error rate, primary feature broken | | P3 | Partial degradation, workaround exists | Same day | One region/service slow, single feature broken | | P4 | Minor issue, no user impact | Within days | Monitoring gap, cosmetic issue | ## Phase 1: Triage (first 5-10 minutes) Goal: confirm the incident, assess severity, start communication. ``` 1. CONFIRM — is this actually broken? - Check from multiple locations/devices - Check AWS Status / DigitalOcean Status / upstream providers - Ask: is anyone else seeing this? 2. SCOPE — who/what is affected? - Which services? Which regions? Which users? - Is data being lost RIGHT NOW? - Stable or getting worse? 3. DECLARE — P1/P2: declare immediately, don't wait to diagnose - Work: post in incident channel, page on-call, open incident ticket - Homelab: create Vikunja task, start BookStack incident page 4. ASSIGN ROLES (work P1/P2) - Incident Commander: coordinates, communicates, makes calls - Tech Lead: root cause investigation - Comms Lead: stakeholder updates - (Homelab: you're all three) ``` ## Phase 2: Stabilize (before root cause) Fix user impact first. Common actions: ```bash # Roll back last deployment kubectl rollout undo deployment/ -n # Scale up healthy replicas kubectl scale deploy/ --replicas=5 -n # Check rollout history kubectl rollout history deployment/ -n ``` Other mitigations: - Route traffic away from broken region/AZ - Disable the broken feature flag - Restore from backup (data loss) - Rotate credentials (security incident) **A rollback that takes 5 minutes beats a fix that takes 2 hours.** ## Phase 3: Investigate (root cause) Now that users are unblocked: ```bash # Recent events kubectl get events -n --sort-by='.lastTimestamp' | tail -30 # Logs (kubectl) kubectl logs -n deploy/ --since=1h # Logs (Grafana Loki) {namespace=""} # Describe node for resource pressure kubectl describe node ``` For AWS: CloudTrail, CloudWatch Logs, ALB access logs, X-Ray traces. Check Grafana Mimir for the anomaly timestamp — find the inflection point. ## Phase 4: Resolve 1. Deploy actual fix (not just the stabilization mitigation) 2. Verify service is healthy — not just "pods are running": - Check error rates in Grafana - Check latency is normal - Spot-check actual user flows 3. Monitor 15-30 minutes before declaring resolved ## Phase 5: Communicate **During incident (P1/P2 — every 15-30 minutes):** ``` [14:32 UTC] INCIDENT UPDATE — degradation Status: Investigating Impact: Last action: Rolled back deployment v1.2.3 Next update: 14:47 UTC ``` **On resolution:** ``` [15:10 UTC] RESOLVED — is operational Duration: 38 minutes (14:32–15:10 UTC) Root cause: Fix applied: Postmortem: ``` **Work P1: never go silent for > 15 minutes. Communicate first, diagnose second.** ## Phase 6: Post-Incident - Within 24-48h: write postmortem (use `writing-postmortem` skill if available) - Update runbooks with anything that was missing - Create Vikunja tasks for action items - Save incident timeline to BookStack ## Security Incidents: Extra Steps Order matters — don't skip ahead: 1. **ISOLATE** — kill or network-isolate the compromised resource before investigating 2. **PRESERVE** — snapshot, export logs before destroying anything 3. **ROTATE** — all potentially exposed credentials immediately 4. **NOTIFY** — security team, CISO, legal as appropriate 5. **SCOPE before disclosing** — do not announce publicly until you understand blast radius GDPR: data breaches require regulatory notification within 72 hours. ## Homelab Specifics - Create Vikunja task in relevant project when declaring - Document timeline in BookStack: `Ansiblestack` book → new page `Incident YYYY-MM-DD: ` - No stakeholder comms needed, but still write the postmortem — future-you will thank you ## Common Homelab Incidents | Incident | Quick fix | |----------|-----------| | OpenBao sealed | `kubectl exec -n openbao openbao-0 -- bao status` — should auto-unseal via OCI KMS; check OCI KMS key status if not | | ArgoCD all apps OutOfSync | Check Forgejo is reachable; check ArgoCD repo credentials | | cert-manager not issuing | Check DNS propagation; check DigitalOcean token; check cert-manager pod logs | | NFS storage unavailable | Check NFS server at 10.0.6.2; check pods in `nfs-provisioner` namespace | | All pods evicted | Node disk pressure — `kubectl describe node <name>`, check disk usage | ## Common Mistakes | Mistake | Reality | |---------|---------| | Diagnosing in silence for 30+ minutes | Communicate first, even with "investigating" | | Fixing before declaring | Declaration triggers backup/support; don't skip it | | Declaring resolved before monitoring | Check error rates and latency, not just pod status | | Investigating before stabilizing | Users are down while you read logs. Roll back first. | | Skipping postmortem on homelab | You will hit this again. Write it down. |