autojanet/skills/incident-response/SKILL.md
Zoë cc74ad0bd0
Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
fix: use library/ Harbor project, add skills, fix pipeline secrets
- .woodpecker.yaml: image paths -> library/autojanet-{agent,dispatcher}
- .woodpecker.yaml: secret names RS_HARBOR_USER / RS_HARBOR_PASS (global)
- container/Dockerfile: restore COPY skills/, skills/ populated from opencode config
- skills/: 84 opencode skills bundled into image
- k8s/manifests: update image refs to library/
2026-05-30 15:43:14 -07:00

168 lines
5.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
name: incident-response
description: Use when responding to production outages, data loss events, security incidents, or major service degradations in homelab (k3s/ansiblestack) or professional (AWS/EKS) environments. Applies at any severity — P1 complete outages to P4 minor issues.
---
# Incident Response
## Overview
Structured response for production incidents. Severity scales the rigor. Homelab P3 is not work P1.
**Core principle:** Stabilize user impact FIRST. Understand why SECOND. Never diagnose in silence.
## Severity
| Severity | Definition | Response SLA | Examples |
|----------|------------|--------------|---------|
| P1 | Complete outage OR data loss OR security breach | Immediate (minutes) | Prod DB down, credentials leaked, all users blocked |
| P2 | Major degradation, SLA at risk, significant user impact | Urgent (< 30 min) | 50%+ error rate, primary feature broken |
| P3 | Partial degradation, workaround exists | Same day | One region/service slow, single feature broken |
| P4 | Minor issue, no user impact | Within days | Monitoring gap, cosmetic issue |
## Phase 1: Triage (first 5-10 minutes)
Goal: confirm the incident, assess severity, start communication.
```
1. CONFIRM — is this actually broken?
- Check from multiple locations/devices
- Check AWS Status / DigitalOcean Status / upstream providers
- Ask: is anyone else seeing this?
2. SCOPE — who/what is affected?
- Which services? Which regions? Which users?
- Is data being lost RIGHT NOW?
- Stable or getting worse?
3. DECLARE — P1/P2: declare immediately, don't wait to diagnose
- Work: post in incident channel, page on-call, open incident ticket
- Homelab: create Vikunja task, start BookStack incident page
4. ASSIGN ROLES (work P1/P2)
- Incident Commander: coordinates, communicates, makes calls
- Tech Lead: root cause investigation
- Comms Lead: stakeholder updates
- (Homelab: you're all three)
```
## Phase 2: Stabilize (before root cause)
Fix user impact first. Common actions:
```bash
# Roll back last deployment
kubectl rollout undo deployment/<name> -n <ns>
# Scale up healthy replicas
kubectl scale deploy/<name> --replicas=5 -n <ns>
# Check rollout history
kubectl rollout history deployment/<name> -n <ns>
```
Other mitigations:
- Route traffic away from broken region/AZ
- Disable the broken feature flag
- Restore from backup (data loss)
- Rotate credentials (security incident)
**A rollback that takes 5 minutes beats a fix that takes 2 hours.**
## Phase 3: Investigate (root cause)
Now that users are unblocked:
```bash
# Recent events
kubectl get events -n <ns> --sort-by='.lastTimestamp' | tail -30
# Logs (kubectl)
kubectl logs -n <ns> deploy/<name> --since=1h
# Logs (Grafana Loki)
{namespace="<ns>"}
# Describe node for resource pressure
kubectl describe node <name>
```
For AWS: CloudTrail, CloudWatch Logs, ALB access logs, X-Ray traces.
Check Grafana Mimir for the anomaly timestamp find the inflection point.
## Phase 4: Resolve
1. Deploy actual fix (not just the stabilization mitigation)
2. Verify service is healthy not just "pods are running":
- Check error rates in Grafana
- Check latency is normal
- Spot-check actual user flows
3. Monitor 15-30 minutes before declaring resolved
## Phase 5: Communicate
**During incident (P1/P2 — every 15-30 minutes):**
```
[14:32 UTC] INCIDENT UPDATE — <service> degradation
Status: Investigating
Impact: <X users/services affected>
Last action: Rolled back deployment v1.2.3
Next update: 14:47 UTC
```
**On resolution:**
```
[15:10 UTC] RESOLVED — <service> is operational
Duration: 38 minutes (14:3215:10 UTC)
Root cause: <brief description>
Fix applied: <what was done>
Postmortem: <link or "to follow within 48h">
```
**Work P1: never go silent for > 15 minutes. Communicate first, diagnose second.**
## Phase 6: Post-Incident
- Within 24-48h: write postmortem (use `writing-postmortem` skill if available)
- Update runbooks with anything that was missing
- Create Vikunja tasks for action items
- Save incident timeline to BookStack
## Security Incidents: Extra Steps
Order matters don't skip ahead:
1. **ISOLATE** kill or network-isolate the compromised resource before investigating
2. **PRESERVE** snapshot, export logs before destroying anything
3. **ROTATE** all potentially exposed credentials immediately
4. **NOTIFY** security team, CISO, legal as appropriate
5. **SCOPE before disclosing** do not announce publicly until you understand blast radius
GDPR: data breaches require regulatory notification within 72 hours.
## Homelab Specifics
- Create Vikunja task in relevant project when declaring
- Document timeline in BookStack: `Ansiblestack` book new page `Incident YYYY-MM-DD: <title>`
- No stakeholder comms needed, but still write the postmortem future-you will thank you
## Common Homelab Incidents
| Incident | Quick fix |
|----------|-----------|
| OpenBao sealed | `kubectl exec -n openbao openbao-0 -- bao status` should auto-unseal via OCI KMS; check OCI KMS key status if not |
| ArgoCD all apps OutOfSync | Check Forgejo is reachable; check ArgoCD repo credentials |
| cert-manager not issuing | Check DNS propagation; check DigitalOcean token; check cert-manager pod logs |
| NFS storage unavailable | Check NFS server at 10.0.6.2; check pods in `nfs-provisioner` namespace |
| All pods evicted | Node disk pressure `kubectl describe node <name>`, check disk usage |
## Common Mistakes
| Mistake | Reality |
|---------|---------|
| Diagnosing in silence for 30+ minutes | Communicate first, even with "investigating" |
| Fixing before declaring | Declaration triggers backup/support; don't skip it |
| Declaring resolved before monitoring | Check error rates and latency, not just pod status |
| Investigating before stabilizing | Users are down while you read logs. Roll back first. |
| Skipping postmortem on homelab | You will hit this again. Write it down. |