Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
- .woodpecker.yaml: image paths -> library/autojanet-{agent,dispatcher}
- .woodpecker.yaml: secret names RS_HARBOR_USER / RS_HARBOR_PASS (global)
- container/Dockerfile: restore COPY skills/, skills/ populated from opencode config
- skills/: 84 opencode skills bundled into image
- k8s/manifests: update image refs to library/
193 lines
5.3 KiB
Markdown
193 lines
5.3 KiB
Markdown
---
|
|
name: cnpg-database
|
|
description: Use when deploying, configuring, or troubleshooting CloudNativePG PostgreSQL clusters on Zoe's k3s homelab, including bootstrapping, secrets, S3 backups, migrations, and common failure modes.
|
|
---
|
|
|
|
# CloudNativePG (CNPG) on k3s Homelab
|
|
|
|
## Overview
|
|
|
|
Deploy and operate CNPG PostgreSQL clusters on the production k3s cluster at `10.0.6.10`. CNPG operator v1.28.1. Always use ArgoCD sync-waves to enforce creation order.
|
|
|
|
## Environment
|
|
|
|
| Setting | Value |
|
|
|---------|-------|
|
|
| CNPG operator | 1.28.1 |
|
|
| PostgreSQL image | `ghcr.io/cloudnative-pg/postgresql:18.1-system-trixie` (includes pgvector as `vector.so`) |
|
|
| Fast storage | `nvme` (NFS-NVMe) |
|
|
| Standard storage | `ssd` (NFS-SSD) |
|
|
| S3 endpoint | `https://s3.ctz.fyi` |
|
|
| S3 bucket | `cnpg-backups` |
|
|
| Secrets backend | External Secrets Operator → ClusterSecretStore `openbao` |
|
|
| OpenBao path | `secret/production/<namespace>/<cluster-name>` |
|
|
|
|
## Sync-Wave Order (Critical)
|
|
|
|
| Wave | Resource |
|
|
|------|----------|
|
|
| `-2` | CNPG `Cluster` |
|
|
| `-1` | `ExternalSecret` for DB credentials |
|
|
| `0` | App `Deployment` |
|
|
|
|
## Step 1 — Write Secrets to OpenBao
|
|
|
|
Do this **before** deploying anything:
|
|
|
|
```bash
|
|
bao kv put secret/production/<namespace>/<app>-db \
|
|
username=<app> \
|
|
password=$(openssl rand -base64 32 | tr -d /=+ | head -c 32)
|
|
```
|
|
|
|
Also create the backup credentials secret once per namespace:
|
|
```bash
|
|
bao kv put secret/production/<namespace>/cnpg-backup-s3-credentials \
|
|
ACCESS_KEY_ID=<key> \
|
|
ACCESS_SECRET_KEY=<secret>
|
|
```
|
|
|
|
## Step 2 — ExternalSecret (sync-wave -1)
|
|
|
|
```yaml
|
|
apiVersion: external-secrets.io/v1
|
|
kind: ExternalSecret
|
|
metadata:
|
|
name: <app>-db-credentials
|
|
namespace: <app>
|
|
annotations:
|
|
argocd.argoproj.io/sync-wave: "-1"
|
|
spec:
|
|
refreshInterval: 1h
|
|
secretStoreRef:
|
|
name: openbao
|
|
kind: ClusterSecretStore
|
|
target:
|
|
name: <app>-db-credentials
|
|
creationPolicy: Owner
|
|
data:
|
|
- secretKey: username
|
|
remoteRef:
|
|
key: secret/production/<namespace>/<app>-db
|
|
property: username
|
|
- secretKey: password
|
|
remoteRef:
|
|
key: secret/production/<namespace>/<app>-db
|
|
property: password
|
|
```
|
|
|
|
## Step 3 — CNPG Cluster (sync-wave -2)
|
|
|
|
```yaml
|
|
apiVersion: postgresql.cnpg.io/v1
|
|
kind: Cluster
|
|
metadata:
|
|
name: <app>-db
|
|
namespace: <app>
|
|
annotations:
|
|
argocd.argoproj.io/sync-wave: "-2"
|
|
spec:
|
|
instances: 3 # Use 1 for dev/small workloads
|
|
imageName: ghcr.io/cloudnative-pg/postgresql:18.1-system-trixie
|
|
|
|
storage:
|
|
size: 10Gi
|
|
storageClass: nvme # or ssd
|
|
|
|
bootstrap:
|
|
initdb:
|
|
database: <app>
|
|
owner: <app>
|
|
secret:
|
|
name: <app>-db-credentials # MUST have keys 'username' and 'password' exactly
|
|
|
|
backup:
|
|
barmanObjectStore:
|
|
destinationPath: s3://cnpg-backups/<app>
|
|
endpointURL: https://s3.ctz.fyi
|
|
s3Credentials:
|
|
accessKeyId:
|
|
name: cnpg-backup-s3-credentials
|
|
key: ACCESS_KEY_ID
|
|
secretAccessKey:
|
|
name: cnpg-backup-s3-credentials
|
|
key: ACCESS_SECRET_KEY
|
|
retentionPolicy: "30d"
|
|
```
|
|
|
|
## CRITICAL: Secret Key Names
|
|
|
|
> **The bootstrap secret MUST have keys named exactly `username` and `password`.**
|
|
> CNPG will appear healthy but the app cannot connect if keys are wrong (e.g., `user`, `pass`, `POSTGRES_USER`).
|
|
> CNPG does NOT create a separate `-app` secret when `bootstrap.initdb.secret` is provided.
|
|
|
|
## Connecting from the App
|
|
|
|
CNPG auto-creates these services:
|
|
|
|
| Service | Use |
|
|
|---------|-----|
|
|
| `<cluster>-rw` | Read-write (primary) — **use this for app writes** |
|
|
| `<cluster>-ro` | Read-only (replicas) — use for read-heavy queries |
|
|
| `<cluster>-r` | Any instance |
|
|
|
|
```
|
|
postgresql://<username>:<password>@<app>-db-rw.<namespace>.svc.cluster.local:5432/<database>
|
|
```
|
|
|
|
## Manual Database Access
|
|
|
|
```bash
|
|
# psql on primary
|
|
kubectl exec -n <namespace> -it <cluster>-1 -- psql -U <username> <database>
|
|
|
|
# via cnpg plugin
|
|
kubectl cnpg psql <cluster> -n <namespace>
|
|
|
|
# pg_dump
|
|
kubectl exec -n <namespace> <cluster>-1 -- \
|
|
pg_dump -U <username> <database> > dump.sql
|
|
|
|
# restore
|
|
kubectl exec -n <namespace> -i <cluster>-1 -- \
|
|
psql -U <username> <database> < dump.sql
|
|
```
|
|
|
|
## Migrating from Docker/External Postgres
|
|
|
|
```bash
|
|
# 1. Dump from source
|
|
pg_dump -h <old-host> -U <user> <database> > dump.sql
|
|
|
|
# 2. Copy into pod
|
|
kubectl cp dump.sql <namespace>/<pod>:/tmp/dump.sql
|
|
|
|
# 3. Restore
|
|
kubectl exec -n <namespace> -it <pod> -- \
|
|
psql -U <username> <database> -f /tmp/dump.sql
|
|
```
|
|
|
|
## Scheduled Backups (Optional)
|
|
|
|
```yaml
|
|
apiVersion: postgresql.cnpg.io/v1
|
|
kind: ScheduledBackup
|
|
metadata:
|
|
name: <app>-db-backup
|
|
namespace: <app>
|
|
spec:
|
|
schedule: "0 2 * * *" # 2am daily
|
|
backupOwnerReference: self
|
|
cluster:
|
|
name: <app>-db
|
|
```
|
|
|
|
## Common Issues
|
|
|
|
| Symptom | Cause | Fix |
|
|
|---------|-------|-----|
|
|
| Cluster stuck at "Setting up primary" | Secret missing or wrong key names | Check `<app>-db-credentials` exists and has `username`/`password` keys |
|
|
| Pod in `Pending` | PVC can't provision | Check `nvme`/`ssd` NFS provisioner is healthy |
|
|
| App can't connect | Using pod IP or wrong service | Use `<cluster>-rw` service, not pod IP |
|
|
| 2/3 instances after node failure | Normal self-healing | Wait — CNPG will recover automatically |
|
|
| Stale data after cluster recreation | Old PVCs still present | Delete PVCs manually before clean redeploy |
|