- .woodpecker.yaml: image paths -> library/autojanet-{agent,dispatcher}
- .woodpecker.yaml: secret names RS_HARBOR_USER / RS_HARBOR_PASS (global)
- container/Dockerfile: restore COPY skills/, skills/ populated from opencode config
- skills/: 84 opencode skills bundled into image
- k8s/manifests: update image refs to library/
8 KiB
| name | description |
|---|---|
| investigating-cluster-issue | Use when debugging Kubernetes issues on Zoe's homelab k3s cluster (k3s v1.35, Cilium, Traefik, ArgoCD, OpenBao, Grafana stack) or on AWS EKS clusters — pod failures, sync errors, networking problems, storage issues, node failures, or any unexpected cluster behavior. |
Investigating Cluster Issues
Overview
Systematic triage for Kubernetes problems. Always run Level 1 first to establish ground truth before narrowing down. Resist the urge to jump straight to logs — node and pod status often reveals the real problem faster.
Environment Reference
k3s homelab:
- Nodes: master-01/02/03, worker-01/02, gpu-node
- CNI: Cilium | Ingress: Traefik | GitOps: ArgoCD (
argocd.ctz.fyi) - Secrets: External Secrets Operator + OpenBao (
bao.ctz.fyi) - Monitoring: Grafana (
grafana.monitoring.ctz.fyi) — Mimir, Loki, Tempo - Storage:
ssd(NFS),local-path - Registry: Harbor (
registry.ctz.fyi) - Key namespaces:
argocd,monitoring,keycloak,external-secrets,cert-manager,traefik,openbao
EKS:
- Addons: aws-load-balancer-controller, external-dns, cluster-autoscaler, kube-prometheus-stack
- Storage: EBS CSI (
gp3preferred), EFS for shared - Auth: IRSA for pod AWS access
- Networking: aws-vpc-cni or Cilium + Calico network policies
Quick Reference: Symptom → First Command
| Symptom | First command |
|---|---|
Pod stuck Pending |
kubectl describe pod <pod> -n <ns> → check Events |
CrashLoopBackOff |
kubectl logs <pod> -n <ns> --previous |
ImagePullBackOff |
kubectl describe pod <pod> -n <ns> → check image + secret |
| Secret not available | kubectl get externalsecret -n <ns> |
| ArgoCD sync failing | kubectl get application <name> -n argocd -o yaml → .status.conditions |
| TLS cert not issuing | kubectl get certificate -n <ns> |
| Node not Ready | kubectl describe node <name> → Events + Conditions |
| EKS ALB not creating | kubectl describe ingress <name> -n <ns> → check controller logs |
| Cluster-wide chaos | kubectl get events -A --sort-by='.lastTimestamp' | tail -30 |
| Not sure where to start | Run all three Level 1 commands |
Level 1 — Immediate Triage (always run first)
kubectl get nodes -o wide
kubectl get pods -A | grep -Ev '(Running|Completed)'
kubectl get events -A --sort-by='.lastTimestamp' | tail -30
Read the events output carefully — it frequently names the exact problem.
Level 2 — Narrow to Failing Resource
kubectl describe pod <name> -n <ns> # Events section is the most useful part
kubectl logs <pod> -n <ns> --previous # If pod restarted
kubectl logs <pod> -n <ns> -c <container> # Multi-container pods
Level 3 — Root Causes by Symptom
Pod stuck Pending
- Check describe events for
FailedScheduling— resource constraints, taints/tolerations, affinity rules - Check PVCs:
kubectl get pvc -n <ns>- k3s: If PVC Pending, check NFS provisioner:
kubectl get pods -n nfs-provisioner - EKS: Check EBS CSI driver:
kubectl get pods -n kube-system -l app=ebs-csi-controller; verify IRSA annotation on ServiceAccount
- k3s: If PVC Pending, check NFS provisioner:
CrashLoopBackOff
kubectl logs <pod> --previous— look for panic, missing env var, missing file, bad config- Check ExternalSecret synced:
kubectl get externalsecret -n <ns>—SecretSyncedErroris common - Check dependent services (DB, cache, upstream API)
- k3s ArgoCD: Check sync-wave ordering — ExternalSecret must have lower wave number than Deployment
ArgoCD sync failing (k3s)
kubectl get application <name> -n argocd -o yaml # .status.conditions
kubectl get application <name> -n argocd -o jsonpath='{.status.operationState.message}'
- OutOfSync on immutable field → manually delete the resource, then re-sync
- ExternalSecret missing → check OpenBao (see below)
- Force refresh without sync: ArgoCD UI → hard refresh, or:
kubectl annotate application <name> -n argocd argocd.argoproj.io/refresh=hard
External Secrets not syncing
kubectl describe externalsecret <name> -n <ns> # .status.conditions
kubectl get clustersecretstore openbao -o yaml # check Ready condition
kubectl exec -n openbao openbao-0 -- bao status # check sealed/unsealed
- OpenBao sealed: Normally auto-unseals via OCI KMS. If stuck:
kubectl exec -n openbao openbao-0 -- bao operator unseal - ClusterSecretStore not Ready: Check the ESO controller logs:
kubectl logs -n external-secrets deploy/external-secrets -f
ImagePullBackOff
kubectl describe pod <name> -n <ns> # look for "401 Unauthorized" or "not found"
- Wrong image tag → fix in manifest/values
- Missing
imagePullSecret→ verify secret exists:kubectl get secret -n <ns> - k3s Harbor auth: Ensure secret references
registry.ctz.fyiand is attached to ServiceAccount or pod spec - Registry unreachable → check Harbor pod health:
kubectl get pods -n harbor
IngressRoute / TLS not working (k3s)
kubectl get certificate -n <ns> # Ready=False = problem
kubectl describe certificate <name> -n <ns> # check Events
kubectl get ingressroute -n <ns>
kubectl get ingress -n <ns> # cert-manager needs a standard Ingress to issue
- cert-manager needs a standard
Ingressresource alongsideIngressRoute— if missing, cert won't issue - Check Traefik pods:
kubectl get pods -n traefik
EKS — Node not joining
kubectl get configmap aws-auth -n kube-system -o yaml # verify node IAM role mapped
# On the node:
journalctl -u kubelet -n 100
- Check security groups: nodes need port 443 outbound to control plane endpoint
- Check node IAM role has
AmazonEKSWorkerNodePolicy,AmazonEKS_CNI_Policy,AmazonEC2ContainerRegistryReadOnly
EKS — ALB/NLB not creating
kubectl describe ingress <name> -n <ns>
kubectl logs -n kube-system deploy/aws-load-balancer-controller | tail -50
- Verify annotations:
kubernetes.io/ingress.class: alb - Check IRSA: ServiceAccount must have
eks.amazonaws.com/role-arnannotation - Check controller has correct IAM permissions (policy document)
Level 4 — System-Level Checks
# k3s control plane
kubectl get componentstatuses
# On master nodes:
systemctl status k3s
# Cilium (k3s)
kubectl -n kube-system exec ds/cilium -- cilium status
kubectl -n kube-system get pods -l k8s-app=cilium
# Resource pressure (both environments)
kubectl top nodes
kubectl top pods -A --sort-by=memory | head -20
# EKS cluster info
aws eks describe-cluster --name <cluster> --region <region>
Level 5 — Logs via Grafana (k3s)
Grafana: grafana.monitoring.ctz.fyi
Loki log queries:
{namespace="<ns>"}
{namespace="<ns>", app="<name>"} |= "error"
{namespace="<ns>"} | logfmt | level="error"
Mimir (metrics): Check CPU/memory graphs around the time of failure — spikes often correlate with OOMKills or throttling that don't appear in kubectl describe.
Live Debugging Inside a Container
kubectl exec -it <pod> -n <ns> -- /bin/sh
# or if bash available:
kubectl exec -it <pod> -n <ns> -- bash
# multi-container:
kubectl exec -it <pod> -n <ns> -c <container> -- /bin/sh
Use for: verifying env vars, testing connectivity (curl, wget, nslookup), checking mounted files.
Restart vs Dig Deeper
Restart first when:
- Pod is in unknown/evicted state with no clear cause
- You've already identified the root cause and fixed it
- OOMKilled and you're about to bump memory limits
Dig deeper first when:
- CrashLoopBackOff with no obvious cause (logs will be lost on restart)
- Data loss risk
- Same pod keeps restarting after restart → there's a real problem, not a transient one
- Multiple pods affected → likely systemic, not pod-specific
Never restart ArgoCD-managed resources directly — ArgoCD will re-sync to desired state. Fix the underlying cause (secret, config, image) and let ArgoCD reconcile, or trigger a manual sync.