--- name: investigating-cluster-issue description: Use when debugging Kubernetes issues on Zoe's homelab k3s cluster (k3s v1.35, Cilium, Traefik, ArgoCD, OpenBao, Grafana stack) or on AWS EKS clusters — pod failures, sync errors, networking problems, storage issues, node failures, or any unexpected cluster behavior. --- # Investigating Cluster Issues ## Overview Systematic triage for Kubernetes problems. Always run Level 1 first to establish ground truth before narrowing down. Resist the urge to jump straight to logs — node and pod status often reveals the real problem faster. ## Environment Reference **k3s homelab:** - Nodes: master-01/02/03, worker-01/02, gpu-node - CNI: Cilium | Ingress: Traefik | GitOps: ArgoCD (`argocd.ctz.fyi`) - Secrets: External Secrets Operator + OpenBao (`bao.ctz.fyi`) - Monitoring: Grafana (`grafana.monitoring.ctz.fyi`) — Mimir, Loki, Tempo - Storage: `ssd` (NFS), `local-path` - Registry: Harbor (`registry.ctz.fyi`) - Key namespaces: `argocd`, `monitoring`, `keycloak`, `external-secrets`, `cert-manager`, `traefik`, `openbao` **EKS:** - Addons: aws-load-balancer-controller, external-dns, cluster-autoscaler, kube-prometheus-stack - Storage: EBS CSI (`gp3` preferred), EFS for shared - Auth: IRSA for pod AWS access - Networking: aws-vpc-cni or Cilium + Calico network policies --- ## Quick Reference: Symptom → First Command | Symptom | First command | |---------|--------------| | Pod stuck `Pending` | `kubectl describe pod -n ` → check Events | | `CrashLoopBackOff` | `kubectl logs -n --previous` | | `ImagePullBackOff` | `kubectl describe pod -n ` → check image + secret | | Secret not available | `kubectl get externalsecret -n ` | | ArgoCD sync failing | `kubectl get application -n argocd -o yaml` → `.status.conditions` | | TLS cert not issuing | `kubectl get certificate -n ` | | Node not Ready | `kubectl describe node ` → Events + Conditions | | EKS ALB not creating | `kubectl describe ingress -n ` → check controller logs | | Cluster-wide chaos | `kubectl get events -A --sort-by='.lastTimestamp' \| tail -30` | | Not sure where to start | Run all three Level 1 commands | --- ## Level 1 — Immediate Triage (always run first) ```bash kubectl get nodes -o wide kubectl get pods -A | grep -Ev '(Running|Completed)' kubectl get events -A --sort-by='.lastTimestamp' | tail -30 ``` Read the events output carefully — it frequently names the exact problem. --- ## Level 2 — Narrow to Failing Resource ```bash kubectl describe pod -n # Events section is the most useful part kubectl logs -n --previous # If pod restarted kubectl logs -n -c # Multi-container pods ``` --- ## Level 3 — Root Causes by Symptom ### Pod stuck `Pending` 1. Check describe events for `FailedScheduling` — resource constraints, taints/tolerations, affinity rules 2. Check PVCs: `kubectl get pvc -n ` - **k3s:** If PVC Pending, check NFS provisioner: `kubectl get pods -n nfs-provisioner` - **EKS:** Check EBS CSI driver: `kubectl get pods -n kube-system -l app=ebs-csi-controller`; verify IRSA annotation on ServiceAccount ### `CrashLoopBackOff` 1. `kubectl logs --previous` — look for panic, missing env var, missing file, bad config 2. Check ExternalSecret synced: `kubectl get externalsecret -n ` — `SecretSyncedError` is common 3. Check dependent services (DB, cache, upstream API) 4. **k3s ArgoCD:** Check sync-wave ordering — ExternalSecret must have lower wave number than Deployment ### ArgoCD sync failing (k3s) ```bash kubectl get application -n argocd -o yaml # .status.conditions kubectl get application -n argocd -o jsonpath='{.status.operationState.message}' ``` - **OutOfSync on immutable field** → manually delete the resource, then re-sync - **ExternalSecret missing** → check OpenBao (see below) - Force refresh without sync: ArgoCD UI → hard refresh, or: ```bash kubectl annotate application -n argocd argocd.argoproj.io/refresh=hard ``` ### External Secrets not syncing ```bash kubectl describe externalsecret -n # .status.conditions kubectl get clustersecretstore openbao -o yaml # check Ready condition kubectl exec -n openbao openbao-0 -- bao status # check sealed/unsealed ``` - **OpenBao sealed:** Normally auto-unseals via OCI KMS. If stuck: ```bash kubectl exec -n openbao openbao-0 -- bao operator unseal ``` - **ClusterSecretStore not Ready:** Check the ESO controller logs: ```bash kubectl logs -n external-secrets deploy/external-secrets -f ``` ### `ImagePullBackOff` ```bash kubectl describe pod -n # look for "401 Unauthorized" or "not found" ``` - Wrong image tag → fix in manifest/values - Missing `imagePullSecret` → verify secret exists: `kubectl get secret -n ` - **k3s Harbor auth:** Ensure secret references `registry.ctz.fyi` and is attached to ServiceAccount or pod spec - Registry unreachable → check Harbor pod health: `kubectl get pods -n harbor` ### IngressRoute / TLS not working (k3s) ```bash kubectl get certificate -n # Ready=False = problem kubectl describe certificate -n # check Events kubectl get ingressroute -n kubectl get ingress -n # cert-manager needs a standard Ingress to issue ``` - cert-manager needs a standard `Ingress` resource alongside `IngressRoute` — if missing, cert won't issue - Check Traefik pods: `kubectl get pods -n traefik` ### EKS — Node not joining ```bash kubectl get configmap aws-auth -n kube-system -o yaml # verify node IAM role mapped # On the node: journalctl -u kubelet -n 100 ``` - Check security groups: nodes need port 443 outbound to control plane endpoint - Check node IAM role has `AmazonEKSWorkerNodePolicy`, `AmazonEKS_CNI_Policy`, `AmazonEC2ContainerRegistryReadOnly` ### EKS — ALB/NLB not creating ```bash kubectl describe ingress -n kubectl logs -n kube-system deploy/aws-load-balancer-controller | tail -50 ``` - Verify annotations: `kubernetes.io/ingress.class: alb` - Check IRSA: ServiceAccount must have `eks.amazonaws.com/role-arn` annotation - Check controller has correct IAM permissions (policy document) --- ## Level 4 — System-Level Checks ```bash # k3s control plane kubectl get componentstatuses # On master nodes: systemctl status k3s # Cilium (k3s) kubectl -n kube-system exec ds/cilium -- cilium status kubectl -n kube-system get pods -l k8s-app=cilium # Resource pressure (both environments) kubectl top nodes kubectl top pods -A --sort-by=memory | head -20 # EKS cluster info aws eks describe-cluster --name --region ``` --- ## Level 5 — Logs via Grafana (k3s) Grafana: `grafana.monitoring.ctz.fyi` **Loki log queries:** ``` {namespace=""} {namespace="", app=""} |= "error" {namespace=""} | logfmt | level="error" ``` **Mimir (metrics):** Check CPU/memory graphs around the time of failure — spikes often correlate with OOMKills or throttling that don't appear in kubectl describe. --- ## Live Debugging Inside a Container ```bash kubectl exec -it -n -- /bin/sh # or if bash available: kubectl exec -it -n -- bash # multi-container: kubectl exec -it -n -c -- /bin/sh ``` Use for: verifying env vars, testing connectivity (`curl`, `wget`, `nslookup`), checking mounted files. --- ## Restart vs Dig Deeper **Restart first when:** - Pod is in unknown/evicted state with no clear cause - You've already identified the root cause and fixed it - OOMKilled and you're about to bump memory limits **Dig deeper first when:** - CrashLoopBackOff with no obvious cause (logs will be lost on restart) - Data loss risk - Same pod keeps restarting after restart → there's a real problem, not a transient one - Multiple pods affected → likely systemic, not pod-specific **Never restart ArgoCD-managed resources directly** — ArgoCD will re-sync to desired state. Fix the underlying cause (secret, config, image) and let ArgoCD reconcile, or trigger a manual sync.