--- name: writing-postmortem description: Use when writing a postmortem, incident review, or after-action report for any outage, degradation, security event, or significant failure — whether homelab or professional context. --- # Skill: Writing a Postmortem ## Overview A postmortem is a blameless structured document that captures what happened, why it happened, what the impact was, and how to prevent recurrence. It is a learning tool and a commitment to improvement — not a blame assignment document. **Core principle:** Humans make mistakes. The question is: what system conditions made the mistake harmful? ## Workflow 1. **Gather the timeline** — check logs, chat history, alerts, kubectl events, monitoring dashboards 2. **Apply 5-whys to find root cause** — keep asking "why?" until you hit a system or process gap, not a person 3. **Write draft** — start with timeline, work backwards to root cause 4. **Review** — have someone else read it in work contexts; fresh eyes catch gaps 5. **Finalize action items** — share with team before locking 6. **Save to BookStack:** - Homelab: Create page in Ansiblestack book (ID 79), named `Postmortem YYYY-MM-DD: ` - Work: Create page in relevant project book or a dedicated "Postmortems" chapter ## Severity Guide | Severity | Criteria | Response | |----------|----------|----------| | P1 | Complete outage, data loss, security breach | All hands, immediate | | P2 | Major degradation, SLA at risk, significant user impact | Urgent | | P3 | Partial degradation, workaround available, limited impact | Fix within hours | | P4 | Minor issue, no user impact, caught in monitoring | Fix within days | ## Document Template ```markdown # Postmortem: [Incident Title] **Date:** YYYY-MM-DD **Duration:** HH:MM (detection to resolution) **Severity:** P1 / P2 / P3 / P4 **Status:** Resolved / Ongoing **Author(s):** [names] **Reviewers:** [names — work context] ## Executive Summary [2-3 sentences. What broke, how long, what was the user/business impact. Written for non-technical stakeholders.] ## Timeline | Time (UTC) | Event | |------------|-------| | HH:MM | First symptom / alert fired | | HH:MM | Detection / someone noticed | | HH:MM | Investigation started | | HH:MM | Root cause identified | | HH:MM | Mitigation applied | | HH:MM | Service restored | | HH:MM | Full resolution / monitoring confirmed stable | ## Impact - **Services affected:** [list] - **Users affected:** [number, "all users", or "internal only"] - **Data loss:** [yes/no — if yes, describe] - **Revenue/SLA impact:** [if applicable] ## Root Cause [1-2 paragraphs. The actual technical cause. Not "human error" — go deeper: what condition made the error possible?] ## Contributing Factors - [e.g., no alerting on X metric] - [e.g., deploy process lacked verification step] - [e.g., documentation was out of date] ## What Went Well - [e.g., monitoring caught it within 5 minutes] - [e.g., rollback procedure worked as expected] ## What Went Poorly - [e.g., 45 minutes to identify root cause due to missing logs] - [e.g., no runbook for this failure mode] ## Action Items | Action | Owner | Due Date | Priority | |--------|-------|----------|----------| | [Specific action] | [name/team] | YYYY-MM-DD | P1/P2/P3 | ## Lessons Learned [2-3 sentences. The key insight(s) that will change how things are done. If nothing changed, the postmortem wasn't honest enough.] ``` ## Blameless Root Cause "Engineer ran the wrong command" is **not** a root cause. "We had no guard against running destructive commands on production" **is**. Action items must address systems and processes, not individuals. ## Good vs Bad Action Items | ❌ Bad | ✅ Good | |--------|---------| | "Be more careful when deploying" | "Add `--dry-run` verification step to deploy runbook" | | "Add more monitoring" | "Add Grafana alert for OpenBao seal status with 5-minute threshold" | | "Improve communication" | "Add #incidents Slack channel to runbook as required notification step" | ## Tone - Past tense, factual, specific timestamps - "The service was unavailable for 47 minutes" — not "there was some downtime" - "We did not have alerting on X" — not "unfortunately alerting was missing" - Action items use active voice: "Add X", "Update Y", "Remove Z"