ci/woodpecker/push/woodpecker Pipeline failed

Details

fix: use library/ Harbor project, add skills, fix pipeline secrets

- .woodpecker.yaml: image paths -> library/autojanet-{agent,dispatcher}
- .woodpecker.yaml: secret names RS_HARBOR_USER / RS_HARBOR_PASS (global)
- container/Dockerfile: restore COPY skills/, skills/ populated from opencode config
- skills/: 84 opencode skills bundled into image
- k8s/manifests: update image refs to library/

2026-05-30 15:43:14 -07:00

4.2 KiB

Raw Permalink Blame History

name	description
writing-postmortem	Use when writing a postmortem, incident review, or after-action report for any outage, degradation, security event, or significant failure — whether homelab or professional context.

Skill: Writing a Postmortem

Overview

A postmortem is a blameless structured document that captures what happened, why it happened, what the impact was, and how to prevent recurrence. It is a learning tool and a commitment to improvement — not a blame assignment document.

Core principle: Humans make mistakes. The question is: what system conditions made the mistake harmful?

Workflow

Gather the timeline — check logs, chat history, alerts, kubectl events, monitoring dashboards
Apply 5-whys to find root cause — keep asking "why?" until you hit a system or process gap, not a person
Write draft — start with timeline, work backwards to root cause
Review — have someone else read it in work contexts; fresh eyes catch gaps
Finalize action items — share with team before locking
Save to BookStack:
- Homelab: Create page in Ansiblestack book (ID 79), named Postmortem YYYY-MM-DD: <title>
- Work: Create page in relevant project book or a dedicated "Postmortems" chapter

Severity Guide

Severity	Criteria	Response
P1	Complete outage, data loss, security breach	All hands, immediate
P2	Major degradation, SLA at risk, significant user impact	Urgent
P3	Partial degradation, workaround available, limited impact	Fix within hours
P4	Minor issue, no user impact, caught in monitoring	Fix within days

Document Template

# Postmortem: [Incident Title]

**Date:** YYYY-MM-DD
**Duration:** HH:MM (detection to resolution)
**Severity:** P1 / P2 / P3 / P4
**Status:** Resolved / Ongoing
**Author(s):** [names]
**Reviewers:** [names — work context]

## Executive Summary

[2-3 sentences. What broke, how long, what was the user/business impact. Written for non-technical stakeholders.]

## Timeline

| Time (UTC) | Event |
|------------|-------|
| HH:MM | First symptom / alert fired |
| HH:MM | Detection / someone noticed |
| HH:MM | Investigation started |
| HH:MM | Root cause identified |
| HH:MM | Mitigation applied |
| HH:MM | Service restored |
| HH:MM | Full resolution / monitoring confirmed stable |

## Impact

- **Services affected:** [list]
- **Users affected:** [number, "all users", or "internal only"]
- **Data loss:** [yes/no — if yes, describe]
- **Revenue/SLA impact:** [if applicable]

## Root Cause

[1-2 paragraphs. The actual technical cause. Not "human error" — go deeper: what condition made the error possible?]

## Contributing Factors

- [e.g., no alerting on X metric]
- [e.g., deploy process lacked verification step]
- [e.g., documentation was out of date]

## What Went Well

- [e.g., monitoring caught it within 5 minutes]
- [e.g., rollback procedure worked as expected]

## What Went Poorly

- [e.g., 45 minutes to identify root cause due to missing logs]
- [e.g., no runbook for this failure mode]

## Action Items

| Action | Owner | Due Date | Priority |
|--------|-------|----------|----------|
| [Specific action] | [name/team] | YYYY-MM-DD | P1/P2/P3 |

## Lessons Learned

[2-3 sentences. The key insight(s) that will change how things are done. If nothing changed, the postmortem wasn't honest enough.]

Blameless Root Cause

"Engineer ran the wrong command" is not a root cause. "We had no guard against running destructive commands on production" is.

Action items must address systems and processes, not individuals.

Good vs Bad Action Items

❌ Bad	✅ Good
"Be more careful when deploying"	"Add `--dry-run` verification step to deploy runbook"
"Add more monitoring"	"Add Grafana alert for OpenBao seal status with 5-minute threshold"
"Improve communication"	"Add #incidents Slack channel to runbook as required notification step"

Tone

Past tense, factual, specific timestamps
"The service was unavailable for 47 minutes" — not "there was some downtime"
"We did not have alerting on X" — not "unfortunately alerting was missing"
Action items use active voice: "Add X", "Update Y", "Remove Z"

4.2 KiB Raw Permalink Blame History