autojanet/skills/writing-postmortem/SKILL.md
Zoë cc74ad0bd0
Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
fix: use library/ Harbor project, add skills, fix pipeline secrets
- .woodpecker.yaml: image paths -> library/autojanet-{agent,dispatcher}
- .woodpecker.yaml: secret names RS_HARBOR_USER / RS_HARBOR_PASS (global)
- container/Dockerfile: restore COPY skills/, skills/ populated from opencode config
- skills/: 84 opencode skills bundled into image
- k8s/manifests: update image refs to library/
2026-05-30 15:43:14 -07:00

4.2 KiB

name description
writing-postmortem Use when writing a postmortem, incident review, or after-action report for any outage, degradation, security event, or significant failure — whether homelab or professional context.

Skill: Writing a Postmortem

Overview

A postmortem is a blameless structured document that captures what happened, why it happened, what the impact was, and how to prevent recurrence. It is a learning tool and a commitment to improvement — not a blame assignment document.

Core principle: Humans make mistakes. The question is: what system conditions made the mistake harmful?

Workflow

  1. Gather the timeline — check logs, chat history, alerts, kubectl events, monitoring dashboards
  2. Apply 5-whys to find root cause — keep asking "why?" until you hit a system or process gap, not a person
  3. Write draft — start with timeline, work backwards to root cause
  4. Review — have someone else read it in work contexts; fresh eyes catch gaps
  5. Finalize action items — share with team before locking
  6. Save to BookStack:
    • Homelab: Create page in Ansiblestack book (ID 79), named Postmortem YYYY-MM-DD: <title>
    • Work: Create page in relevant project book or a dedicated "Postmortems" chapter

Severity Guide

Severity Criteria Response
P1 Complete outage, data loss, security breach All hands, immediate
P2 Major degradation, SLA at risk, significant user impact Urgent
P3 Partial degradation, workaround available, limited impact Fix within hours
P4 Minor issue, no user impact, caught in monitoring Fix within days

Document Template

# Postmortem: [Incident Title]

**Date:** YYYY-MM-DD
**Duration:** HH:MM (detection to resolution)
**Severity:** P1 / P2 / P3 / P4
**Status:** Resolved / Ongoing
**Author(s):** [names]
**Reviewers:** [names — work context]

## Executive Summary

[2-3 sentences. What broke, how long, what was the user/business impact. Written for non-technical stakeholders.]

## Timeline

| Time (UTC) | Event |
|------------|-------|
| HH:MM | First symptom / alert fired |
| HH:MM | Detection / someone noticed |
| HH:MM | Investigation started |
| HH:MM | Root cause identified |
| HH:MM | Mitigation applied |
| HH:MM | Service restored |
| HH:MM | Full resolution / monitoring confirmed stable |

## Impact

- **Services affected:** [list]
- **Users affected:** [number, "all users", or "internal only"]
- **Data loss:** [yes/no — if yes, describe]
- **Revenue/SLA impact:** [if applicable]

## Root Cause

[1-2 paragraphs. The actual technical cause. Not "human error" — go deeper: what condition made the error possible?]

## Contributing Factors

- [e.g., no alerting on X metric]
- [e.g., deploy process lacked verification step]
- [e.g., documentation was out of date]

## What Went Well

- [e.g., monitoring caught it within 5 minutes]
- [e.g., rollback procedure worked as expected]

## What Went Poorly

- [e.g., 45 minutes to identify root cause due to missing logs]
- [e.g., no runbook for this failure mode]

## Action Items

| Action | Owner | Due Date | Priority |
|--------|-------|----------|----------|
| [Specific action] | [name/team] | YYYY-MM-DD | P1/P2/P3 |

## Lessons Learned

[2-3 sentences. The key insight(s) that will change how things are done. If nothing changed, the postmortem wasn't honest enough.]

Blameless Root Cause

"Engineer ran the wrong command" is not a root cause. "We had no guard against running destructive commands on production" is.

Action items must address systems and processes, not individuals.

Good vs Bad Action Items

Bad Good
"Be more careful when deploying" "Add --dry-run verification step to deploy runbook"
"Add more monitoring" "Add Grafana alert for OpenBao seal status with 5-minute threshold"
"Improve communication" "Add #incidents Slack channel to runbook as required notification step"

Tone

  • Past tense, factual, specific timestamps
  • "The service was unavailable for 47 minutes" — not "there was some downtime"
  • "We did not have alerting on X" — not "unfortunately alerting was missing"
  • Action items use active voice: "Add X", "Update Y", "Remove Z"