autojanet/skills/incident-response/SKILL.md

---
name: incident-response
description: Use when responding to production outages, data loss events, security incidents, or major service degradations in homelab (k3s/ansiblestack) or professional (AWS/EKS) environments. Applies at any severity — P1 complete outages to P4 minor issues.
---

# Incident Response

## Overview

Structured response for production incidents. Severity scales the rigor. Homelab P3 is not work P1.

**Core principle:** Stabilize user impact FIRST. Understand why SECOND. Never diagnose in silence.

## Severity

| Severity | Definition | Response SLA | Examples |
|----------|------------|--------------|---------|
| P1 | Complete outage OR data loss OR security breach | Immediate (minutes) | Prod DB down, credentials leaked, all users blocked |
| P2 | Major degradation, SLA at risk, significant user impact | Urgent (< 30 min) | 50%+ error rate, primary feature broken |
| P3 | Partial degradation, workaround exists | Same day | One region/service slow, single feature broken |
| P4 | Minor issue, no user impact | Within days | Monitoring gap, cosmetic issue |

## Phase 1: Triage (first 5-10 minutes)

Goal: confirm the incident, assess severity, start communication.

```
1. CONFIRM — is this actually broken?
   - Check from multiple locations/devices
   - Check AWS Status / DigitalOcean Status / upstream providers
   - Ask: is anyone else seeing this?

2. SCOPE — who/what is affected?
   - Which services? Which regions? Which users?
   - Is data being lost RIGHT NOW?
   - Stable or getting worse?

3. DECLARE — P1/P2: declare immediately, don't wait to diagnose
   - Work: post in incident channel, page on-call, open incident ticket
   - Homelab: create Vikunja task, start BookStack incident page

4. ASSIGN ROLES (work P1/P2)
   - Incident Commander: coordinates, communicates, makes calls
   - Tech Lead: root cause investigation
   - Comms Lead: stakeholder updates
   - (Homelab: you're all three)
```

## Phase 2: Stabilize (before root cause)

Fix user impact first. Common actions:

```bash
# Roll back last deployment
kubectl rollout undo deployment/<name> -n <ns>

# Scale up healthy replicas
kubectl scale deploy/<name> --replicas=5 -n <ns>

# Check rollout history
kubectl rollout history deployment/<name> -n <ns>
```

Other mitigations:
- Route traffic away from broken region/AZ
- Disable the broken feature flag
- Restore from backup (data loss)
- Rotate credentials (security incident)

**A rollback that takes 5 minutes beats a fix that takes 2 hours.**

## Phase 3: Investigate (root cause)

Now that users are unblocked:

```bash
# Recent events
kubectl get events -n <ns> --sort-by='.lastTimestamp' | tail -30

# Logs (kubectl)
kubectl logs -n <ns> deploy/<name> --since=1h

# Logs (Grafana Loki)
{namespace="<ns>"}

# Describe node for resource pressure
kubectl describe node <name>
```

For AWS: CloudTrail, CloudWatch Logs, ALB access logs, X-Ray traces.

Check Grafana Mimir for the anomaly timestamp — find the inflection point.

## Phase 4: Resolve

1. Deploy actual fix (not just the stabilization mitigation)
2. Verify service is healthy — not just "pods are running":
   - Check error rates in Grafana
   - Check latency is normal
   - Spot-check actual user flows
3. Monitor 15-30 minutes before declaring resolved

## Phase 5: Communicate

**During incident (P1/P2 — every 15-30 minutes):**
```
[14:32 UTC] INCIDENT UPDATE — <service> degradation
Status: Investigating
Impact: <X users/services affected>
Last action: Rolled back deployment v1.2.3
Next update: 14:47 UTC
```

**On resolution:**
```
[15:10 UTC] RESOLVED — <service> is operational
Duration: 38 minutes (14:32–15:10 UTC)
Root cause: <brief description>
Fix applied: <what was done>
Postmortem: <link or "to follow within 48h">
```

**Work P1: never go silent for > 15 minutes. Communicate first, diagnose second.**

## Phase 6: Post-Incident

- Within 24-48h: write postmortem (use `writing-postmortem` skill if available)
- Update runbooks with anything that was missing
- Create Vikunja tasks for action items
- Save incident timeline to BookStack

## Security Incidents: Extra Steps

Order matters — don't skip ahead:

1. **ISOLATE** — kill or network-isolate the compromised resource before investigating
2. **PRESERVE** — snapshot, export logs before destroying anything
3. **ROTATE** — all potentially exposed credentials immediately
4. **NOTIFY** — security team, CISO, legal as appropriate
5. **SCOPE before disclosing** — do not announce publicly until you understand blast radius

GDPR: data breaches require regulatory notification within 72 hours.

## Homelab Specifics

- Create Vikunja task in relevant project when declaring
- Document timeline in BookStack: `Ansiblestack` book → new page `Incident YYYY-MM-DD: <title>`
- No stakeholder comms needed, but still write the postmortem — future-you will thank you

## Common Homelab Incidents

| Incident | Quick fix |
|----------|-----------|
| OpenBao sealed | `kubectl exec -n openbao openbao-0 -- bao status` — should auto-unseal via OCI KMS; check OCI KMS key status if not |
| ArgoCD all apps OutOfSync | Check Forgejo is reachable; check ArgoCD repo credentials |
| cert-manager not issuing | Check DNS propagation; check DigitalOcean token; check cert-manager pod logs |
| NFS storage unavailable | Check NFS server at 10.0.6.2; check pods in `nfs-provisioner` namespace |
| All pods evicted | Node disk pressure — `kubectl describe node <name>`, check disk usage |

## Common Mistakes

| Mistake | Reality |
|---------|---------|
| Diagnosing in silence for 30+ minutes | Communicate first, even with "investigating" |
| Fixing before declaring | Declaration triggers backup/support; don't skip it |
| Declaring resolved before monitoring | Check error rates and latency, not just pod status |
| Investigating before stabilizing | Users are down while you read logs. Roll back first. |
| Skipping postmortem on homelab | You will hit this again. Write it down. |