- .woodpecker.yaml: image paths -> library/autojanet-{agent,dispatcher}
- .woodpecker.yaml: secret names RS_HARBOR_USER / RS_HARBOR_PASS (global)
- container/Dockerfile: restore COPY skills/, skills/ populated from opencode config
- skills/: 84 opencode skills bundled into image
- k8s/manifests: update image refs to library/
5.9 KiB
| name | description |
|---|---|
| incident-response | Use when responding to production outages, data loss events, security incidents, or major service degradations in homelab (k3s/ansiblestack) or professional (AWS/EKS) environments. Applies at any severity — P1 complete outages to P4 minor issues. |
Incident Response
Overview
Structured response for production incidents. Severity scales the rigor. Homelab P3 is not work P1.
Core principle: Stabilize user impact FIRST. Understand why SECOND. Never diagnose in silence.
Severity
| Severity | Definition | Response SLA | Examples |
|---|---|---|---|
| P1 | Complete outage OR data loss OR security breach | Immediate (minutes) | Prod DB down, credentials leaked, all users blocked |
| P2 | Major degradation, SLA at risk, significant user impact | Urgent (< 30 min) | 50%+ error rate, primary feature broken |
| P3 | Partial degradation, workaround exists | Same day | One region/service slow, single feature broken |
| P4 | Minor issue, no user impact | Within days | Monitoring gap, cosmetic issue |
Phase 1: Triage (first 5-10 minutes)
Goal: confirm the incident, assess severity, start communication.
1. CONFIRM — is this actually broken?
- Check from multiple locations/devices
- Check AWS Status / DigitalOcean Status / upstream providers
- Ask: is anyone else seeing this?
2. SCOPE — who/what is affected?
- Which services? Which regions? Which users?
- Is data being lost RIGHT NOW?
- Stable or getting worse?
3. DECLARE — P1/P2: declare immediately, don't wait to diagnose
- Work: post in incident channel, page on-call, open incident ticket
- Homelab: create Vikunja task, start BookStack incident page
4. ASSIGN ROLES (work P1/P2)
- Incident Commander: coordinates, communicates, makes calls
- Tech Lead: root cause investigation
- Comms Lead: stakeholder updates
- (Homelab: you're all three)
Phase 2: Stabilize (before root cause)
Fix user impact first. Common actions:
# Roll back last deployment
kubectl rollout undo deployment/<name> -n <ns>
# Scale up healthy replicas
kubectl scale deploy/<name> --replicas=5 -n <ns>
# Check rollout history
kubectl rollout history deployment/<name> -n <ns>
Other mitigations:
- Route traffic away from broken region/AZ
- Disable the broken feature flag
- Restore from backup (data loss)
- Rotate credentials (security incident)
A rollback that takes 5 minutes beats a fix that takes 2 hours.
Phase 3: Investigate (root cause)
Now that users are unblocked:
# Recent events
kubectl get events -n <ns> --sort-by='.lastTimestamp' | tail -30
# Logs (kubectl)
kubectl logs -n <ns> deploy/<name> --since=1h
# Logs (Grafana Loki)
{namespace="<ns>"}
# Describe node for resource pressure
kubectl describe node <name>
For AWS: CloudTrail, CloudWatch Logs, ALB access logs, X-Ray traces.
Check Grafana Mimir for the anomaly timestamp — find the inflection point.
Phase 4: Resolve
- Deploy actual fix (not just the stabilization mitigation)
- Verify service is healthy — not just "pods are running":
- Check error rates in Grafana
- Check latency is normal
- Spot-check actual user flows
- Monitor 15-30 minutes before declaring resolved
Phase 5: Communicate
During incident (P1/P2 — every 15-30 minutes):
[14:32 UTC] INCIDENT UPDATE — <service> degradation
Status: Investigating
Impact: <X users/services affected>
Last action: Rolled back deployment v1.2.3
Next update: 14:47 UTC
On resolution:
[15:10 UTC] RESOLVED — <service> is operational
Duration: 38 minutes (14:32–15:10 UTC)
Root cause: <brief description>
Fix applied: <what was done>
Postmortem: <link or "to follow within 48h">
Work P1: never go silent for > 15 minutes. Communicate first, diagnose second.
Phase 6: Post-Incident
- Within 24-48h: write postmortem (use
writing-postmortemskill if available) - Update runbooks with anything that was missing
- Create Vikunja tasks for action items
- Save incident timeline to BookStack
Security Incidents: Extra Steps
Order matters — don't skip ahead:
- ISOLATE — kill or network-isolate the compromised resource before investigating
- PRESERVE — snapshot, export logs before destroying anything
- ROTATE — all potentially exposed credentials immediately
- NOTIFY — security team, CISO, legal as appropriate
- SCOPE before disclosing — do not announce publicly until you understand blast radius
GDPR: data breaches require regulatory notification within 72 hours.
Homelab Specifics
- Create Vikunja task in relevant project when declaring
- Document timeline in BookStack:
Ansiblestackbook → new pageIncident YYYY-MM-DD: <title> - No stakeholder comms needed, but still write the postmortem — future-you will thank you
Common Homelab Incidents
| Incident | Quick fix |
|---|---|
| OpenBao sealed | kubectl exec -n openbao openbao-0 -- bao status — should auto-unseal via OCI KMS; check OCI KMS key status if not |
| ArgoCD all apps OutOfSync | Check Forgejo is reachable; check ArgoCD repo credentials |
| cert-manager not issuing | Check DNS propagation; check DigitalOcean token; check cert-manager pod logs |
| NFS storage unavailable | Check NFS server at 10.0.6.2; check pods in nfs-provisioner namespace |
| All pods evicted | Node disk pressure — kubectl describe node <name>, check disk usage |
Common Mistakes
| Mistake | Reality |
|---|---|
| Diagnosing in silence for 30+ minutes | Communicate first, even with "investigating" |
| Fixing before declaring | Declaration triggers backup/support; don't skip it |
| Declaring resolved before monitoring | Check error rates and latency, not just pod status |
| Investigating before stabilizing | Users are down while you read logs. Roll back first. |
| Skipping postmortem on homelab | You will hit this again. Write it down. |