ci/woodpecker/push/woodpecker Pipeline failed

Details

fix: use library/ Harbor project, add skills, fix pipeline secrets

- .woodpecker.yaml: image paths -> library/autojanet-{agent,dispatcher}
- .woodpecker.yaml: secret names RS_HARBOR_USER / RS_HARBOR_PASS (global)
- container/Dockerfile: restore COPY skills/, skills/ populated from opencode config
- skills/: 84 opencode skills bundled into image
- k8s/manifests: update image refs to library/

2026-05-30 15:43:14 -07:00

8 KiB

Raw Permalink Blame History

name	description
investigating-cluster-issue	Use when debugging Kubernetes issues on Zoe's homelab k3s cluster (k3s v1.35, Cilium, Traefik, ArgoCD, OpenBao, Grafana stack) or on AWS EKS clusters — pod failures, sync errors, networking problems, storage issues, node failures, or any unexpected cluster behavior.

Investigating Cluster Issues

Overview

Systematic triage for Kubernetes problems. Always run Level 1 first to establish ground truth before narrowing down. Resist the urge to jump straight to logs — node and pod status often reveals the real problem faster.

Environment Reference

k3s homelab:

Nodes: master-01/02/03, worker-01/02, gpu-node
CNI: Cilium | Ingress: Traefik | GitOps: ArgoCD (argocd.ctz.fyi)
Secrets: External Secrets Operator + OpenBao (bao.ctz.fyi)
Monitoring: Grafana (grafana.monitoring.ctz.fyi) — Mimir, Loki, Tempo
Storage: ssd (NFS), local-path
Registry: Harbor (registry.ctz.fyi)
Key namespaces: argocd, monitoring, keycloak, external-secrets, cert-manager, traefik, openbao

EKS:

Addons: aws-load-balancer-controller, external-dns, cluster-autoscaler, kube-prometheus-stack
Storage: EBS CSI (gp3 preferred), EFS for shared
Auth: IRSA for pod AWS access
Networking: aws-vpc-cni or Cilium + Calico network policies

Quick Reference: Symptom → First Command

Symptom	First command
Pod stuck `Pending`	`kubectl describe pod <pod> -n <ns>` → check Events
`CrashLoopBackOff`	`kubectl logs <pod> -n <ns> --previous`
`ImagePullBackOff`	`kubectl describe pod <pod> -n <ns>` → check image + secret
Secret not available	`kubectl get externalsecret -n <ns>`
ArgoCD sync failing	`kubectl get application <name> -n argocd -o yaml` → `.status.conditions`
TLS cert not issuing	`kubectl get certificate -n <ns>`
Node not Ready	`kubectl describe node <name>` → Events + Conditions
EKS ALB not creating	`kubectl describe ingress <name> -n <ns>` → check controller logs
Cluster-wide chaos	`kubectl get events -A --sort-by='.lastTimestamp' \| tail -30`
Not sure where to start	Run all three Level 1 commands

Level 1 — Immediate Triage (always run first)

kubectl get nodes -o wide
kubectl get pods -A | grep -Ev '(Running|Completed)'
kubectl get events -A --sort-by='.lastTimestamp' | tail -30

Read the events output carefully — it frequently names the exact problem.

Level 2 — Narrow to Failing Resource

kubectl describe pod <name> -n <ns>        # Events section is the most useful part
kubectl logs <pod> -n <ns> --previous      # If pod restarted
kubectl logs <pod> -n <ns> -c <container>  # Multi-container pods

Level 3 — Root Causes by Symptom

Pod stuck `Pending`

Check describe events for FailedScheduling — resource constraints, taints/tolerations, affinity rules
Check PVCs: kubectl get pvc -n <ns>
- k3s: If PVC Pending, check NFS provisioner: kubectl get pods -n nfs-provisioner
- EKS: Check EBS CSI driver: kubectl get pods -n kube-system -l app=ebs-csi-controller; verify IRSA annotation on ServiceAccount

`CrashLoopBackOff`

kubectl logs <pod> --previous — look for panic, missing env var, missing file, bad config
Check ExternalSecret synced: kubectl get externalsecret -n <ns> — SecretSyncedError is common
Check dependent services (DB, cache, upstream API)
k3s ArgoCD: Check sync-wave ordering — ExternalSecret must have lower wave number than Deployment

ArgoCD sync failing (k3s)

kubectl get application <name> -n argocd -o yaml   # .status.conditions
kubectl get application <name> -n argocd -o jsonpath='{.status.operationState.message}'

OutOfSync on immutable field → manually delete the resource, then re-sync
ExternalSecret missing → check OpenBao (see below)

Force refresh without sync: ArgoCD UI → hard refresh, or:

kubectl annotate application <name> -n argocd argocd.argoproj.io/refresh=hard

External Secrets not syncing

kubectl describe externalsecret <name> -n <ns>     # .status.conditions
kubectl get clustersecretstore openbao -o yaml     # check Ready condition
kubectl exec -n openbao openbao-0 -- bao status    # check sealed/unsealed

OpenBao sealed: Normally auto-unseals via OCI KMS. If stuck:

kubectl exec -n openbao openbao-0 -- bao operator unseal

ClusterSecretStore not Ready: Check the ESO controller logs:

kubectl logs -n external-secrets deploy/external-secrets -f

`ImagePullBackOff`

kubectl describe pod <name> -n <ns>   # look for "401 Unauthorized" or "not found"

Wrong image tag → fix in manifest/values
Missing imagePullSecret → verify secret exists: kubectl get secret -n <ns>
k3s Harbor auth: Ensure secret references registry.ctz.fyi and is attached to ServiceAccount or pod spec
Registry unreachable → check Harbor pod health: kubectl get pods -n harbor

IngressRoute / TLS not working (k3s)

kubectl get certificate -n <ns>                   # Ready=False = problem
kubectl describe certificate <name> -n <ns>       # check Events
kubectl get ingressroute -n <ns>
kubectl get ingress -n <ns>                       # cert-manager needs a standard Ingress to issue

cert-manager needs a standard Ingress resource alongside IngressRoute — if missing, cert won't issue
Check Traefik pods: kubectl get pods -n traefik

EKS — Node not joining

kubectl get configmap aws-auth -n kube-system -o yaml    # verify node IAM role mapped
# On the node:
journalctl -u kubelet -n 100

Check security groups: nodes need port 443 outbound to control plane endpoint
Check node IAM role has AmazonEKSWorkerNodePolicy, AmazonEKS_CNI_Policy, AmazonEC2ContainerRegistryReadOnly

EKS — ALB/NLB not creating

kubectl describe ingress <name> -n <ns>
kubectl logs -n kube-system deploy/aws-load-balancer-controller | tail -50

Verify annotations: kubernetes.io/ingress.class: alb
Check IRSA: ServiceAccount must have eks.amazonaws.com/role-arn annotation
Check controller has correct IAM permissions (policy document)

Level 4 — System-Level Checks

# k3s control plane
kubectl get componentstatuses
# On master nodes:
systemctl status k3s

# Cilium (k3s)
kubectl -n kube-system exec ds/cilium -- cilium status
kubectl -n kube-system get pods -l k8s-app=cilium

# Resource pressure (both environments)
kubectl top nodes
kubectl top pods -A --sort-by=memory | head -20

# EKS cluster info
aws eks describe-cluster --name <cluster> --region <region>

Level 5 — Logs via Grafana (k3s)

Grafana: grafana.monitoring.ctz.fyi

Loki log queries:

{namespace="<ns>"}
{namespace="<ns>", app="<name>"} |= "error"
{namespace="<ns>"} | logfmt | level="error"

Mimir (metrics): Check CPU/memory graphs around the time of failure — spikes often correlate with OOMKills or throttling that don't appear in kubectl describe.

Live Debugging Inside a Container

kubectl exec -it <pod> -n <ns> -- /bin/sh
# or if bash available:
kubectl exec -it <pod> -n <ns> -- bash
# multi-container:
kubectl exec -it <pod> -n <ns> -c <container> -- /bin/sh

Use for: verifying env vars, testing connectivity (curl, wget, nslookup), checking mounted files.

Restart vs Dig Deeper

Restart first when:

Pod is in unknown/evicted state with no clear cause
You've already identified the root cause and fixed it
OOMKilled and you're about to bump memory limits

Dig deeper first when:

CrashLoopBackOff with no obvious cause (logs will be lost on restart)
Data loss risk
Same pod keeps restarting after restart → there's a real problem, not a transient one
Multiple pods affected → likely systemic, not pod-specific

Never restart ArgoCD-managed resources directly — ArgoCD will re-sync to desired state. Fix the underlying cause (secret, config, image) and let ArgoCD reconcile, or trigger a manual sync.

8 KiB Raw Permalink Blame History