Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
- .woodpecker.yaml: image paths -> library/autojanet-{agent,dispatcher}
- .woodpecker.yaml: secret names RS_HARBOR_USER / RS_HARBOR_PASS (global)
- container/Dockerfile: restore COPY skills/, skills/ populated from opencode config
- skills/: 84 opencode skills bundled into image
- k8s/manifests: update image refs to library/
4.2 KiB
4.2 KiB
| name | description |
|---|---|
| writing-postmortem | Use when writing a postmortem, incident review, or after-action report for any outage, degradation, security event, or significant failure — whether homelab or professional context. |
Skill: Writing a Postmortem
Overview
A postmortem is a blameless structured document that captures what happened, why it happened, what the impact was, and how to prevent recurrence. It is a learning tool and a commitment to improvement — not a blame assignment document.
Core principle: Humans make mistakes. The question is: what system conditions made the mistake harmful?
Workflow
- Gather the timeline — check logs, chat history, alerts, kubectl events, monitoring dashboards
- Apply 5-whys to find root cause — keep asking "why?" until you hit a system or process gap, not a person
- Write draft — start with timeline, work backwards to root cause
- Review — have someone else read it in work contexts; fresh eyes catch gaps
- Finalize action items — share with team before locking
- Save to BookStack:
- Homelab: Create page in Ansiblestack book (ID 79), named
Postmortem YYYY-MM-DD: <title> - Work: Create page in relevant project book or a dedicated "Postmortems" chapter
- Homelab: Create page in Ansiblestack book (ID 79), named
Severity Guide
| Severity | Criteria | Response |
|---|---|---|
| P1 | Complete outage, data loss, security breach | All hands, immediate |
| P2 | Major degradation, SLA at risk, significant user impact | Urgent |
| P3 | Partial degradation, workaround available, limited impact | Fix within hours |
| P4 | Minor issue, no user impact, caught in monitoring | Fix within days |
Document Template
# Postmortem: [Incident Title]
**Date:** YYYY-MM-DD
**Duration:** HH:MM (detection to resolution)
**Severity:** P1 / P2 / P3 / P4
**Status:** Resolved / Ongoing
**Author(s):** [names]
**Reviewers:** [names — work context]
## Executive Summary
[2-3 sentences. What broke, how long, what was the user/business impact. Written for non-technical stakeholders.]
## Timeline
| Time (UTC) | Event |
|------------|-------|
| HH:MM | First symptom / alert fired |
| HH:MM | Detection / someone noticed |
| HH:MM | Investigation started |
| HH:MM | Root cause identified |
| HH:MM | Mitigation applied |
| HH:MM | Service restored |
| HH:MM | Full resolution / monitoring confirmed stable |
## Impact
- **Services affected:** [list]
- **Users affected:** [number, "all users", or "internal only"]
- **Data loss:** [yes/no — if yes, describe]
- **Revenue/SLA impact:** [if applicable]
## Root Cause
[1-2 paragraphs. The actual technical cause. Not "human error" — go deeper: what condition made the error possible?]
## Contributing Factors
- [e.g., no alerting on X metric]
- [e.g., deploy process lacked verification step]
- [e.g., documentation was out of date]
## What Went Well
- [e.g., monitoring caught it within 5 minutes]
- [e.g., rollback procedure worked as expected]
## What Went Poorly
- [e.g., 45 minutes to identify root cause due to missing logs]
- [e.g., no runbook for this failure mode]
## Action Items
| Action | Owner | Due Date | Priority |
|--------|-------|----------|----------|
| [Specific action] | [name/team] | YYYY-MM-DD | P1/P2/P3 |
## Lessons Learned
[2-3 sentences. The key insight(s) that will change how things are done. If nothing changed, the postmortem wasn't honest enough.]
Blameless Root Cause
"Engineer ran the wrong command" is not a root cause. "We had no guard against running destructive commands on production" is.
Action items must address systems and processes, not individuals.
Good vs Bad Action Items
| ❌ Bad | ✅ Good |
|---|---|
| "Be more careful when deploying" | "Add --dry-run verification step to deploy runbook" |
| "Add more monitoring" | "Add Grafana alert for OpenBao seal status with 5-minute threshold" |
| "Improve communication" | "Add #incidents Slack channel to runbook as required notification step" |
Tone
- Past tense, factual, specific timestamps
- "The service was unavailable for 47 minutes" — not "there was some downtime"
- "We did not have alerting on X" — not "unfortunately alerting was missing"
- Action items use active voice: "Add X", "Update Y", "Remove Z"