Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
- .woodpecker.yaml: image paths -> library/autojanet-{agent,dispatcher}
- .woodpecker.yaml: secret names RS_HARBOR_USER / RS_HARBOR_PASS (global)
- container/Dockerfile: restore COPY skills/, skills/ populated from opencode config
- skills/: 84 opencode skills bundled into image
- k8s/manifests: update image refs to library/
120 lines
4.2 KiB
Markdown
120 lines
4.2 KiB
Markdown
---
|
|
name: writing-postmortem
|
|
description: Use when writing a postmortem, incident review, or after-action report for any outage, degradation, security event, or significant failure — whether homelab or professional context.
|
|
---
|
|
|
|
# Skill: Writing a Postmortem
|
|
|
|
## Overview
|
|
|
|
A postmortem is a blameless structured document that captures what happened, why it happened, what the impact was, and how to prevent recurrence. It is a learning tool and a commitment to improvement — not a blame assignment document.
|
|
|
|
**Core principle:** Humans make mistakes. The question is: what system conditions made the mistake harmful?
|
|
|
|
## Workflow
|
|
|
|
1. **Gather the timeline** — check logs, chat history, alerts, kubectl events, monitoring dashboards
|
|
2. **Apply 5-whys to find root cause** — keep asking "why?" until you hit a system or process gap, not a person
|
|
3. **Write draft** — start with timeline, work backwards to root cause
|
|
4. **Review** — have someone else read it in work contexts; fresh eyes catch gaps
|
|
5. **Finalize action items** — share with team before locking
|
|
6. **Save to BookStack:**
|
|
- Homelab: Create page in Ansiblestack book (ID 79), named `Postmortem YYYY-MM-DD: <title>`
|
|
- Work: Create page in relevant project book or a dedicated "Postmortems" chapter
|
|
|
|
## Severity Guide
|
|
|
|
| Severity | Criteria | Response |
|
|
|----------|----------|----------|
|
|
| P1 | Complete outage, data loss, security breach | All hands, immediate |
|
|
| P2 | Major degradation, SLA at risk, significant user impact | Urgent |
|
|
| P3 | Partial degradation, workaround available, limited impact | Fix within hours |
|
|
| P4 | Minor issue, no user impact, caught in monitoring | Fix within days |
|
|
|
|
## Document Template
|
|
|
|
```markdown
|
|
# Postmortem: [Incident Title]
|
|
|
|
**Date:** YYYY-MM-DD
|
|
**Duration:** HH:MM (detection to resolution)
|
|
**Severity:** P1 / P2 / P3 / P4
|
|
**Status:** Resolved / Ongoing
|
|
**Author(s):** [names]
|
|
**Reviewers:** [names — work context]
|
|
|
|
## Executive Summary
|
|
|
|
[2-3 sentences. What broke, how long, what was the user/business impact. Written for non-technical stakeholders.]
|
|
|
|
## Timeline
|
|
|
|
| Time (UTC) | Event |
|
|
|------------|-------|
|
|
| HH:MM | First symptom / alert fired |
|
|
| HH:MM | Detection / someone noticed |
|
|
| HH:MM | Investigation started |
|
|
| HH:MM | Root cause identified |
|
|
| HH:MM | Mitigation applied |
|
|
| HH:MM | Service restored |
|
|
| HH:MM | Full resolution / monitoring confirmed stable |
|
|
|
|
## Impact
|
|
|
|
- **Services affected:** [list]
|
|
- **Users affected:** [number, "all users", or "internal only"]
|
|
- **Data loss:** [yes/no — if yes, describe]
|
|
- **Revenue/SLA impact:** [if applicable]
|
|
|
|
## Root Cause
|
|
|
|
[1-2 paragraphs. The actual technical cause. Not "human error" — go deeper: what condition made the error possible?]
|
|
|
|
## Contributing Factors
|
|
|
|
- [e.g., no alerting on X metric]
|
|
- [e.g., deploy process lacked verification step]
|
|
- [e.g., documentation was out of date]
|
|
|
|
## What Went Well
|
|
|
|
- [e.g., monitoring caught it within 5 minutes]
|
|
- [e.g., rollback procedure worked as expected]
|
|
|
|
## What Went Poorly
|
|
|
|
- [e.g., 45 minutes to identify root cause due to missing logs]
|
|
- [e.g., no runbook for this failure mode]
|
|
|
|
## Action Items
|
|
|
|
| Action | Owner | Due Date | Priority |
|
|
|--------|-------|----------|----------|
|
|
| [Specific action] | [name/team] | YYYY-MM-DD | P1/P2/P3 |
|
|
|
|
## Lessons Learned
|
|
|
|
[2-3 sentences. The key insight(s) that will change how things are done. If nothing changed, the postmortem wasn't honest enough.]
|
|
```
|
|
|
|
## Blameless Root Cause
|
|
|
|
"Engineer ran the wrong command" is **not** a root cause.
|
|
"We had no guard against running destructive commands on production" **is**.
|
|
|
|
Action items must address systems and processes, not individuals.
|
|
|
|
## Good vs Bad Action Items
|
|
|
|
| ❌ Bad | ✅ Good |
|
|
|--------|---------|
|
|
| "Be more careful when deploying" | "Add `--dry-run` verification step to deploy runbook" |
|
|
| "Add more monitoring" | "Add Grafana alert for OpenBao seal status with 5-minute threshold" |
|
|
| "Improve communication" | "Add #incidents Slack channel to runbook as required notification step" |
|
|
|
|
## Tone
|
|
|
|
- Past tense, factual, specific timestamps
|
|
- "The service was unavailable for 47 minutes" — not "there was some downtime"
|
|
- "We did not have alerting on X" — not "unfortunately alerting was missing"
|
|
- Action items use active voice: "Add X", "Update Y", "Remove Z"
|