Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
- .woodpecker.yaml: image paths -> library/autojanet-{agent,dispatcher}
- .woodpecker.yaml: secret names RS_HARBOR_USER / RS_HARBOR_PASS (global)
- container/Dockerfile: restore COPY skills/, skills/ populated from opencode config
- skills/: 84 opencode skills bundled into image
- k8s/manifests: update image refs to library/
168 lines
5.9 KiB
Markdown
168 lines
5.9 KiB
Markdown
---
|
||
name: incident-response
|
||
description: Use when responding to production outages, data loss events, security incidents, or major service degradations in homelab (k3s/ansiblestack) or professional (AWS/EKS) environments. Applies at any severity — P1 complete outages to P4 minor issues.
|
||
---
|
||
|
||
# Incident Response
|
||
|
||
## Overview
|
||
|
||
Structured response for production incidents. Severity scales the rigor. Homelab P3 is not work P1.
|
||
|
||
**Core principle:** Stabilize user impact FIRST. Understand why SECOND. Never diagnose in silence.
|
||
|
||
## Severity
|
||
|
||
| Severity | Definition | Response SLA | Examples |
|
||
|----------|------------|--------------|---------|
|
||
| P1 | Complete outage OR data loss OR security breach | Immediate (minutes) | Prod DB down, credentials leaked, all users blocked |
|
||
| P2 | Major degradation, SLA at risk, significant user impact | Urgent (< 30 min) | 50%+ error rate, primary feature broken |
|
||
| P3 | Partial degradation, workaround exists | Same day | One region/service slow, single feature broken |
|
||
| P4 | Minor issue, no user impact | Within days | Monitoring gap, cosmetic issue |
|
||
|
||
## Phase 1: Triage (first 5-10 minutes)
|
||
|
||
Goal: confirm the incident, assess severity, start communication.
|
||
|
||
```
|
||
1. CONFIRM — is this actually broken?
|
||
- Check from multiple locations/devices
|
||
- Check AWS Status / DigitalOcean Status / upstream providers
|
||
- Ask: is anyone else seeing this?
|
||
|
||
2. SCOPE — who/what is affected?
|
||
- Which services? Which regions? Which users?
|
||
- Is data being lost RIGHT NOW?
|
||
- Stable or getting worse?
|
||
|
||
3. DECLARE — P1/P2: declare immediately, don't wait to diagnose
|
||
- Work: post in incident channel, page on-call, open incident ticket
|
||
- Homelab: create Vikunja task, start BookStack incident page
|
||
|
||
4. ASSIGN ROLES (work P1/P2)
|
||
- Incident Commander: coordinates, communicates, makes calls
|
||
- Tech Lead: root cause investigation
|
||
- Comms Lead: stakeholder updates
|
||
- (Homelab: you're all three)
|
||
```
|
||
|
||
## Phase 2: Stabilize (before root cause)
|
||
|
||
Fix user impact first. Common actions:
|
||
|
||
```bash
|
||
# Roll back last deployment
|
||
kubectl rollout undo deployment/<name> -n <ns>
|
||
|
||
# Scale up healthy replicas
|
||
kubectl scale deploy/<name> --replicas=5 -n <ns>
|
||
|
||
# Check rollout history
|
||
kubectl rollout history deployment/<name> -n <ns>
|
||
```
|
||
|
||
Other mitigations:
|
||
- Route traffic away from broken region/AZ
|
||
- Disable the broken feature flag
|
||
- Restore from backup (data loss)
|
||
- Rotate credentials (security incident)
|
||
|
||
**A rollback that takes 5 minutes beats a fix that takes 2 hours.**
|
||
|
||
## Phase 3: Investigate (root cause)
|
||
|
||
Now that users are unblocked:
|
||
|
||
```bash
|
||
# Recent events
|
||
kubectl get events -n <ns> --sort-by='.lastTimestamp' | tail -30
|
||
|
||
# Logs (kubectl)
|
||
kubectl logs -n <ns> deploy/<name> --since=1h
|
||
|
||
# Logs (Grafana Loki)
|
||
{namespace="<ns>"}
|
||
|
||
# Describe node for resource pressure
|
||
kubectl describe node <name>
|
||
```
|
||
|
||
For AWS: CloudTrail, CloudWatch Logs, ALB access logs, X-Ray traces.
|
||
|
||
Check Grafana Mimir for the anomaly timestamp — find the inflection point.
|
||
|
||
## Phase 4: Resolve
|
||
|
||
1. Deploy actual fix (not just the stabilization mitigation)
|
||
2. Verify service is healthy — not just "pods are running":
|
||
- Check error rates in Grafana
|
||
- Check latency is normal
|
||
- Spot-check actual user flows
|
||
3. Monitor 15-30 minutes before declaring resolved
|
||
|
||
## Phase 5: Communicate
|
||
|
||
**During incident (P1/P2 — every 15-30 minutes):**
|
||
```
|
||
[14:32 UTC] INCIDENT UPDATE — <service> degradation
|
||
Status: Investigating
|
||
Impact: <X users/services affected>
|
||
Last action: Rolled back deployment v1.2.3
|
||
Next update: 14:47 UTC
|
||
```
|
||
|
||
**On resolution:**
|
||
```
|
||
[15:10 UTC] RESOLVED — <service> is operational
|
||
Duration: 38 minutes (14:32–15:10 UTC)
|
||
Root cause: <brief description>
|
||
Fix applied: <what was done>
|
||
Postmortem: <link or "to follow within 48h">
|
||
```
|
||
|
||
**Work P1: never go silent for > 15 minutes. Communicate first, diagnose second.**
|
||
|
||
## Phase 6: Post-Incident
|
||
|
||
- Within 24-48h: write postmortem (use `writing-postmortem` skill if available)
|
||
- Update runbooks with anything that was missing
|
||
- Create Vikunja tasks for action items
|
||
- Save incident timeline to BookStack
|
||
|
||
## Security Incidents: Extra Steps
|
||
|
||
Order matters — don't skip ahead:
|
||
|
||
1. **ISOLATE** — kill or network-isolate the compromised resource before investigating
|
||
2. **PRESERVE** — snapshot, export logs before destroying anything
|
||
3. **ROTATE** — all potentially exposed credentials immediately
|
||
4. **NOTIFY** — security team, CISO, legal as appropriate
|
||
5. **SCOPE before disclosing** — do not announce publicly until you understand blast radius
|
||
|
||
GDPR: data breaches require regulatory notification within 72 hours.
|
||
|
||
## Homelab Specifics
|
||
|
||
- Create Vikunja task in relevant project when declaring
|
||
- Document timeline in BookStack: `Ansiblestack` book → new page `Incident YYYY-MM-DD: <title>`
|
||
- No stakeholder comms needed, but still write the postmortem — future-you will thank you
|
||
|
||
## Common Homelab Incidents
|
||
|
||
| Incident | Quick fix |
|
||
|----------|-----------|
|
||
| OpenBao sealed | `kubectl exec -n openbao openbao-0 -- bao status` — should auto-unseal via OCI KMS; check OCI KMS key status if not |
|
||
| ArgoCD all apps OutOfSync | Check Forgejo is reachable; check ArgoCD repo credentials |
|
||
| cert-manager not issuing | Check DNS propagation; check DigitalOcean token; check cert-manager pod logs |
|
||
| NFS storage unavailable | Check NFS server at 10.0.6.2; check pods in `nfs-provisioner` namespace |
|
||
| All pods evicted | Node disk pressure — `kubectl describe node <name>`, check disk usage |
|
||
|
||
## Common Mistakes
|
||
|
||
| Mistake | Reality |
|
||
|---------|---------|
|
||
| Diagnosing in silence for 30+ minutes | Communicate first, even with "investigating" |
|
||
| Fixing before declaring | Declaration triggers backup/support; don't skip it |
|
||
| Declaring resolved before monitoring | Check error rates and latency, not just pod status |
|
||
| Investigating before stabilizing | Users are down while you read logs. Roll back first. |
|
||
| Skipping postmortem on homelab | You will hit this again. Write it down. |
|