---
name: writing-postmortem
description: Use when writing a postmortem, incident review, or after-action report for any outage, degradation, security event, or significant failure — whether homelab or professional context.
---

# Skill: Writing a Postmortem

## Overview

A postmortem is a blameless structured document that captures what happened, why it happened, what the impact was, and how to prevent recurrence. It is a learning tool and a commitment to improvement — not a blame assignment document.

**Core principle:** Humans make mistakes. The question is: what system conditions made the mistake harmful?

## Workflow

1. **Gather the timeline** — check logs, chat history, alerts, kubectl events, monitoring dashboards
2. **Apply 5-whys to find root cause** — keep asking "why?" until you hit a system or process gap, not a person
3. **Write draft** — start with timeline, work backwards to root cause
4. **Review** — have someone else read it in work contexts; fresh eyes catch gaps
5. **Finalize action items** — share with team before locking
6. **Save to BookStack:**
   - Homelab: Create page in Ansiblestack book (ID 79), named `Postmortem YYYY-MM-DD: <title>`
   - Work: Create page in relevant project book or a dedicated "Postmortems" chapter

## Severity Guide

| Severity | Criteria | Response |
|----------|----------|----------|
| P1 | Complete outage, data loss, security breach | All hands, immediate |
| P2 | Major degradation, SLA at risk, significant user impact | Urgent |
| P3 | Partial degradation, workaround available, limited impact | Fix within hours |
| P4 | Minor issue, no user impact, caught in monitoring | Fix within days |

## Document Template

```markdown
# Postmortem: [Incident Title]

**Date:** YYYY-MM-DD
**Duration:** HH:MM (detection to resolution)
**Severity:** P1 / P2 / P3 / P4
**Status:** Resolved / Ongoing
**Author(s):** [names]
**Reviewers:** [names — work context]

## Executive Summary

[2-3 sentences. What broke, how long, what was the user/business impact. Written for non-technical stakeholders.]

## Timeline

| Time (UTC) | Event |
|------------|-------|
| HH:MM | First symptom / alert fired |
| HH:MM | Detection / someone noticed |
| HH:MM | Investigation started |
| HH:MM | Root cause identified |
| HH:MM | Mitigation applied |
| HH:MM | Service restored |
| HH:MM | Full resolution / monitoring confirmed stable |

## Impact

- **Services affected:** [list]
- **Users affected:** [number, "all users", or "internal only"]
- **Data loss:** [yes/no — if yes, describe]
- **Revenue/SLA impact:** [if applicable]

## Root Cause

[1-2 paragraphs. The actual technical cause. Not "human error" — go deeper: what condition made the error possible?]

## Contributing Factors

- [e.g., no alerting on X metric]
- [e.g., deploy process lacked verification step]
- [e.g., documentation was out of date]

## What Went Well

- [e.g., monitoring caught it within 5 minutes]
- [e.g., rollback procedure worked as expected]

## What Went Poorly

- [e.g., 45 minutes to identify root cause due to missing logs]
- [e.g., no runbook for this failure mode]

## Action Items

| Action | Owner | Due Date | Priority |
|--------|-------|----------|----------|
| [Specific action] | [name/team] | YYYY-MM-DD | P1/P2/P3 |

## Lessons Learned

[2-3 sentences. The key insight(s) that will change how things are done. If nothing changed, the postmortem wasn't honest enough.]
```

## Blameless Root Cause

"Engineer ran the wrong command" is **not** a root cause.
"We had no guard against running destructive commands on production" **is**.

Action items must address systems and processes, not individuals.

## Good vs Bad Action Items

| ❌ Bad | ✅ Good |
|--------|---------|
| "Be more careful when deploying" | "Add `--dry-run` verification step to deploy runbook" |
| "Add more monitoring" | "Add Grafana alert for OpenBao seal status with 5-minute threshold" |
| "Improve communication" | "Add #incidents Slack channel to runbook as required notification step" |

## Tone

- Past tense, factual, specific timestamps
- "The service was unavailable for 47 minutes" — not "there was some downtime"
- "We did not have alerting on X" — not "unfortunately alerting was missing"
- Action items use active voice: "Add X", "Update Y", "Remove Z"