autojanet/skills/cnpg-database/SKILL.md
Zoë cc74ad0bd0
Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
fix: use library/ Harbor project, add skills, fix pipeline secrets
- .woodpecker.yaml: image paths -> library/autojanet-{agent,dispatcher}
- .woodpecker.yaml: secret names RS_HARBOR_USER / RS_HARBOR_PASS (global)
- container/Dockerfile: restore COPY skills/, skills/ populated from opencode config
- skills/: 84 opencode skills bundled into image
- k8s/manifests: update image refs to library/
2026-05-30 15:43:14 -07:00

193 lines
5.3 KiB
Markdown

---
name: cnpg-database
description: Use when deploying, configuring, or troubleshooting CloudNativePG PostgreSQL clusters on Zoe's k3s homelab, including bootstrapping, secrets, S3 backups, migrations, and common failure modes.
---
# CloudNativePG (CNPG) on k3s Homelab
## Overview
Deploy and operate CNPG PostgreSQL clusters on the production k3s cluster at `10.0.6.10`. CNPG operator v1.28.1. Always use ArgoCD sync-waves to enforce creation order.
## Environment
| Setting | Value |
|---------|-------|
| CNPG operator | 1.28.1 |
| PostgreSQL image | `ghcr.io/cloudnative-pg/postgresql:18.1-system-trixie` (includes pgvector as `vector.so`) |
| Fast storage | `nvme` (NFS-NVMe) |
| Standard storage | `ssd` (NFS-SSD) |
| S3 endpoint | `https://s3.ctz.fyi` |
| S3 bucket | `cnpg-backups` |
| Secrets backend | External Secrets Operator → ClusterSecretStore `openbao` |
| OpenBao path | `secret/production/<namespace>/<cluster-name>` |
## Sync-Wave Order (Critical)
| Wave | Resource |
|------|----------|
| `-2` | CNPG `Cluster` |
| `-1` | `ExternalSecret` for DB credentials |
| `0` | App `Deployment` |
## Step 1 — Write Secrets to OpenBao
Do this **before** deploying anything:
```bash
bao kv put secret/production/<namespace>/<app>-db \
username=<app> \
password=$(openssl rand -base64 32 | tr -d /=+ | head -c 32)
```
Also create the backup credentials secret once per namespace:
```bash
bao kv put secret/production/<namespace>/cnpg-backup-s3-credentials \
ACCESS_KEY_ID=<key> \
ACCESS_SECRET_KEY=<secret>
```
## Step 2 — ExternalSecret (sync-wave -1)
```yaml
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
name: <app>-db-credentials
namespace: <app>
annotations:
argocd.argoproj.io/sync-wave: "-1"
spec:
refreshInterval: 1h
secretStoreRef:
name: openbao
kind: ClusterSecretStore
target:
name: <app>-db-credentials
creationPolicy: Owner
data:
- secretKey: username
remoteRef:
key: secret/production/<namespace>/<app>-db
property: username
- secretKey: password
remoteRef:
key: secret/production/<namespace>/<app>-db
property: password
```
## Step 3 — CNPG Cluster (sync-wave -2)
```yaml
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: <app>-db
namespace: <app>
annotations:
argocd.argoproj.io/sync-wave: "-2"
spec:
instances: 3 # Use 1 for dev/small workloads
imageName: ghcr.io/cloudnative-pg/postgresql:18.1-system-trixie
storage:
size: 10Gi
storageClass: nvme # or ssd
bootstrap:
initdb:
database: <app>
owner: <app>
secret:
name: <app>-db-credentials # MUST have keys 'username' and 'password' exactly
backup:
barmanObjectStore:
destinationPath: s3://cnpg-backups/<app>
endpointURL: https://s3.ctz.fyi
s3Credentials:
accessKeyId:
name: cnpg-backup-s3-credentials
key: ACCESS_KEY_ID
secretAccessKey:
name: cnpg-backup-s3-credentials
key: ACCESS_SECRET_KEY
retentionPolicy: "30d"
```
## CRITICAL: Secret Key Names
> **The bootstrap secret MUST have keys named exactly `username` and `password`.**
> CNPG will appear healthy but the app cannot connect if keys are wrong (e.g., `user`, `pass`, `POSTGRES_USER`).
> CNPG does NOT create a separate `-app` secret when `bootstrap.initdb.secret` is provided.
## Connecting from the App
CNPG auto-creates these services:
| Service | Use |
|---------|-----|
| `<cluster>-rw` | Read-write (primary) — **use this for app writes** |
| `<cluster>-ro` | Read-only (replicas) — use for read-heavy queries |
| `<cluster>-r` | Any instance |
```
postgresql://<username>:<password>@<app>-db-rw.<namespace>.svc.cluster.local:5432/<database>
```
## Manual Database Access
```bash
# psql on primary
kubectl exec -n <namespace> -it <cluster>-1 -- psql -U <username> <database>
# via cnpg plugin
kubectl cnpg psql <cluster> -n <namespace>
# pg_dump
kubectl exec -n <namespace> <cluster>-1 -- \
pg_dump -U <username> <database> > dump.sql
# restore
kubectl exec -n <namespace> -i <cluster>-1 -- \
psql -U <username> <database> < dump.sql
```
## Migrating from Docker/External Postgres
```bash
# 1. Dump from source
pg_dump -h <old-host> -U <user> <database> > dump.sql
# 2. Copy into pod
kubectl cp dump.sql <namespace>/<pod>:/tmp/dump.sql
# 3. Restore
kubectl exec -n <namespace> -it <pod> -- \
psql -U <username> <database> -f /tmp/dump.sql
```
## Scheduled Backups (Optional)
```yaml
apiVersion: postgresql.cnpg.io/v1
kind: ScheduledBackup
metadata:
name: <app>-db-backup
namespace: <app>
spec:
schedule: "0 2 * * *" # 2am daily
backupOwnerReference: self
cluster:
name: <app>-db
```
## Common Issues
| Symptom | Cause | Fix |
|---------|-------|-----|
| Cluster stuck at "Setting up primary" | Secret missing or wrong key names | Check `<app>-db-credentials` exists and has `username`/`password` keys |
| Pod in `Pending` | PVC can't provision | Check `nvme`/`ssd` NFS provisioner is healthy |
| App can't connect | Using pod IP or wrong service | Use `<cluster>-rw` service, not pod IP |
| 2/3 instances after node failure | Normal self-healing | Wait — CNPG will recover automatically |
| Stale data after cluster recreation | Old PVCs still present | Delete PVCs manually before clean redeploy |