ci/woodpecker/push/woodpecker Pipeline failed

Details

fix: use library/ Harbor project, add skills, fix pipeline secrets

- .woodpecker.yaml: image paths -> library/autojanet-{agent,dispatcher}
- .woodpecker.yaml: secret names RS_HARBOR_USER / RS_HARBOR_PASS (global)
- container/Dockerfile: restore COPY skills/, skills/ populated from opencode config
- skills/: 84 opencode skills bundled into image
- k8s/manifests: update image refs to library/

2026-05-30 15:43:14 -07:00

5.9 KiB

Raw Permalink Blame History

name	description
incident-response	Use when responding to production outages, data loss events, security incidents, or major service degradations in homelab (k3s/ansiblestack) or professional (AWS/EKS) environments. Applies at any severity — P1 complete outages to P4 minor issues.

Incident Response

Overview

Structured response for production incidents. Severity scales the rigor. Homelab P3 is not work P1.

Core principle: Stabilize user impact FIRST. Understand why SECOND. Never diagnose in silence.

Severity

Severity	Definition	Response SLA	Examples
P1	Complete outage OR data loss OR security breach	Immediate (minutes)	Prod DB down, credentials leaked, all users blocked
P2	Major degradation, SLA at risk, significant user impact	Urgent (< 30 min)	50%+ error rate, primary feature broken
P3	Partial degradation, workaround exists	Same day	One region/service slow, single feature broken
P4	Minor issue, no user impact	Within days	Monitoring gap, cosmetic issue

Phase 1: Triage (first 5-10 minutes)

Goal: confirm the incident, assess severity, start communication.

1. CONFIRM — is this actually broken?
   - Check from multiple locations/devices
   - Check AWS Status / DigitalOcean Status / upstream providers
   - Ask: is anyone else seeing this?

2. SCOPE — who/what is affected?
   - Which services? Which regions? Which users?
   - Is data being lost RIGHT NOW?
   - Stable or getting worse?

3. DECLARE — P1/P2: declare immediately, don't wait to diagnose
   - Work: post in incident channel, page on-call, open incident ticket
   - Homelab: create Vikunja task, start BookStack incident page

4. ASSIGN ROLES (work P1/P2)
   - Incident Commander: coordinates, communicates, makes calls
   - Tech Lead: root cause investigation
   - Comms Lead: stakeholder updates
   - (Homelab: you're all three)

Phase 2: Stabilize (before root cause)

Fix user impact first. Common actions:

# Roll back last deployment
kubectl rollout undo deployment/<name> -n <ns>

# Scale up healthy replicas
kubectl scale deploy/<name> --replicas=5 -n <ns>

# Check rollout history
kubectl rollout history deployment/<name> -n <ns>

Other mitigations:

Route traffic away from broken region/AZ
Disable the broken feature flag
Restore from backup (data loss)
Rotate credentials (security incident)

A rollback that takes 5 minutes beats a fix that takes 2 hours.

Phase 3: Investigate (root cause)

Now that users are unblocked:

# Recent events
kubectl get events -n <ns> --sort-by='.lastTimestamp' | tail -30

# Logs (kubectl)
kubectl logs -n <ns> deploy/<name> --since=1h

# Logs (Grafana Loki)
{namespace="<ns>"}

# Describe node for resource pressure
kubectl describe node <name>

For AWS: CloudTrail, CloudWatch Logs, ALB access logs, X-Ray traces.

Check Grafana Mimir for the anomaly timestamp — find the inflection point.

Phase 4: Resolve

Deploy actual fix (not just the stabilization mitigation)
Verify service is healthy — not just "pods are running":
- Check error rates in Grafana
- Check latency is normal
- Spot-check actual user flows
Monitor 15-30 minutes before declaring resolved

Phase 5: Communicate

During incident (P1/P2 — every 15-30 minutes):

[14:32 UTC] INCIDENT UPDATE — <service> degradation
Status: Investigating
Impact: <X users/services affected>
Last action: Rolled back deployment v1.2.3
Next update: 14:47 UTC

On resolution:

[15:10 UTC] RESOLVED — <service> is operational
Duration: 38 minutes (14:32–15:10 UTC)
Root cause: <brief description>
Fix applied: <what was done>
Postmortem: <link or "to follow within 48h">

Work P1: never go silent for > 15 minutes. Communicate first, diagnose second.

Phase 6: Post-Incident

Within 24-48h: write postmortem (use writing-postmortem skill if available)
Update runbooks with anything that was missing
Create Vikunja tasks for action items
Save incident timeline to BookStack

Security Incidents: Extra Steps

Order matters — don't skip ahead:

ISOLATE — kill or network-isolate the compromised resource before investigating
PRESERVE — snapshot, export logs before destroying anything
ROTATE — all potentially exposed credentials immediately
NOTIFY — security team, CISO, legal as appropriate
SCOPE before disclosing — do not announce publicly until you understand blast radius

GDPR: data breaches require regulatory notification within 72 hours.

Homelab Specifics

Create Vikunja task in relevant project when declaring
Document timeline in BookStack: Ansiblestack book → new page Incident YYYY-MM-DD: <title>
No stakeholder comms needed, but still write the postmortem — future-you will thank you

Common Homelab Incidents

Incident	Quick fix
OpenBao sealed	`kubectl exec -n openbao openbao-0 -- bao status` — should auto-unseal via OCI KMS; check OCI KMS key status if not
ArgoCD all apps OutOfSync	Check Forgejo is reachable; check ArgoCD repo credentials
cert-manager not issuing	Check DNS propagation; check DigitalOcean token; check cert-manager pod logs
NFS storage unavailable	Check NFS server at 10.0.6.2; check pods in `nfs-provisioner` namespace
All pods evicted	Node disk pressure — `kubectl describe node <name>`, check disk usage

Common Mistakes

Mistake	Reality
Diagnosing in silence for 30+ minutes	Communicate first, even with "investigating"
Fixing before declaring	Declaration triggers backup/support; don't skip it
Declaring resolved before monitoring	Check error rates and latency, not just pod status
Investigating before stabilizing	Users are down while you read logs. Roll back first.
Skipping postmortem on homelab	You will hit this again. Write it down.

5.9 KiB Raw Permalink Blame History Unescape Escape