autojanet/skills/investigating-cluster-issue/SKILL.md
Zoë cc74ad0bd0
Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
fix: use library/ Harbor project, add skills, fix pipeline secrets
- .woodpecker.yaml: image paths -> library/autojanet-{agent,dispatcher}
- .woodpecker.yaml: secret names RS_HARBOR_USER / RS_HARBOR_PASS (global)
- container/Dockerfile: restore COPY skills/, skills/ populated from opencode config
- skills/: 84 opencode skills bundled into image
- k8s/manifests: update image refs to library/
2026-05-30 15:43:14 -07:00

8 KiB

name description
investigating-cluster-issue Use when debugging Kubernetes issues on Zoe's homelab k3s cluster (k3s v1.35, Cilium, Traefik, ArgoCD, OpenBao, Grafana stack) or on AWS EKS clusters — pod failures, sync errors, networking problems, storage issues, node failures, or any unexpected cluster behavior.

Investigating Cluster Issues

Overview

Systematic triage for Kubernetes problems. Always run Level 1 first to establish ground truth before narrowing down. Resist the urge to jump straight to logs — node and pod status often reveals the real problem faster.

Environment Reference

k3s homelab:

  • Nodes: master-01/02/03, worker-01/02, gpu-node
  • CNI: Cilium | Ingress: Traefik | GitOps: ArgoCD (argocd.ctz.fyi)
  • Secrets: External Secrets Operator + OpenBao (bao.ctz.fyi)
  • Monitoring: Grafana (grafana.monitoring.ctz.fyi) — Mimir, Loki, Tempo
  • Storage: ssd (NFS), local-path
  • Registry: Harbor (registry.ctz.fyi)
  • Key namespaces: argocd, monitoring, keycloak, external-secrets, cert-manager, traefik, openbao

EKS:

  • Addons: aws-load-balancer-controller, external-dns, cluster-autoscaler, kube-prometheus-stack
  • Storage: EBS CSI (gp3 preferred), EFS for shared
  • Auth: IRSA for pod AWS access
  • Networking: aws-vpc-cni or Cilium + Calico network policies

Quick Reference: Symptom → First Command

Symptom First command
Pod stuck Pending kubectl describe pod <pod> -n <ns> → check Events
CrashLoopBackOff kubectl logs <pod> -n <ns> --previous
ImagePullBackOff kubectl describe pod <pod> -n <ns> → check image + secret
Secret not available kubectl get externalsecret -n <ns>
ArgoCD sync failing kubectl get application <name> -n argocd -o yaml.status.conditions
TLS cert not issuing kubectl get certificate -n <ns>
Node not Ready kubectl describe node <name> → Events + Conditions
EKS ALB not creating kubectl describe ingress <name> -n <ns> → check controller logs
Cluster-wide chaos kubectl get events -A --sort-by='.lastTimestamp' | tail -30
Not sure where to start Run all three Level 1 commands

Level 1 — Immediate Triage (always run first)

kubectl get nodes -o wide
kubectl get pods -A | grep -Ev '(Running|Completed)'
kubectl get events -A --sort-by='.lastTimestamp' | tail -30

Read the events output carefully — it frequently names the exact problem.


Level 2 — Narrow to Failing Resource

kubectl describe pod <name> -n <ns>        # Events section is the most useful part
kubectl logs <pod> -n <ns> --previous      # If pod restarted
kubectl logs <pod> -n <ns> -c <container>  # Multi-container pods

Level 3 — Root Causes by Symptom

Pod stuck Pending

  1. Check describe events for FailedScheduling — resource constraints, taints/tolerations, affinity rules
  2. Check PVCs: kubectl get pvc -n <ns>
    • k3s: If PVC Pending, check NFS provisioner: kubectl get pods -n nfs-provisioner
    • EKS: Check EBS CSI driver: kubectl get pods -n kube-system -l app=ebs-csi-controller; verify IRSA annotation on ServiceAccount

CrashLoopBackOff

  1. kubectl logs <pod> --previous — look for panic, missing env var, missing file, bad config
  2. Check ExternalSecret synced: kubectl get externalsecret -n <ns>SecretSyncedError is common
  3. Check dependent services (DB, cache, upstream API)
  4. k3s ArgoCD: Check sync-wave ordering — ExternalSecret must have lower wave number than Deployment

ArgoCD sync failing (k3s)

kubectl get application <name> -n argocd -o yaml   # .status.conditions
kubectl get application <name> -n argocd -o jsonpath='{.status.operationState.message}'
  • OutOfSync on immutable field → manually delete the resource, then re-sync
  • ExternalSecret missing → check OpenBao (see below)
  • Force refresh without sync: ArgoCD UI → hard refresh, or:
    kubectl annotate application <name> -n argocd argocd.argoproj.io/refresh=hard
    

External Secrets not syncing

kubectl describe externalsecret <name> -n <ns>     # .status.conditions
kubectl get clustersecretstore openbao -o yaml     # check Ready condition
kubectl exec -n openbao openbao-0 -- bao status    # check sealed/unsealed
  • OpenBao sealed: Normally auto-unseals via OCI KMS. If stuck:
    kubectl exec -n openbao openbao-0 -- bao operator unseal
    
  • ClusterSecretStore not Ready: Check the ESO controller logs:
    kubectl logs -n external-secrets deploy/external-secrets -f
    

ImagePullBackOff

kubectl describe pod <name> -n <ns>   # look for "401 Unauthorized" or "not found"
  • Wrong image tag → fix in manifest/values
  • Missing imagePullSecret → verify secret exists: kubectl get secret -n <ns>
  • k3s Harbor auth: Ensure secret references registry.ctz.fyi and is attached to ServiceAccount or pod spec
  • Registry unreachable → check Harbor pod health: kubectl get pods -n harbor

IngressRoute / TLS not working (k3s)

kubectl get certificate -n <ns>                   # Ready=False = problem
kubectl describe certificate <name> -n <ns>       # check Events
kubectl get ingressroute -n <ns>
kubectl get ingress -n <ns>                       # cert-manager needs a standard Ingress to issue
  • cert-manager needs a standard Ingress resource alongside IngressRoute — if missing, cert won't issue
  • Check Traefik pods: kubectl get pods -n traefik

EKS — Node not joining

kubectl get configmap aws-auth -n kube-system -o yaml    # verify node IAM role mapped
# On the node:
journalctl -u kubelet -n 100
  • Check security groups: nodes need port 443 outbound to control plane endpoint
  • Check node IAM role has AmazonEKSWorkerNodePolicy, AmazonEKS_CNI_Policy, AmazonEC2ContainerRegistryReadOnly

EKS — ALB/NLB not creating

kubectl describe ingress <name> -n <ns>
kubectl logs -n kube-system deploy/aws-load-balancer-controller | tail -50
  • Verify annotations: kubernetes.io/ingress.class: alb
  • Check IRSA: ServiceAccount must have eks.amazonaws.com/role-arn annotation
  • Check controller has correct IAM permissions (policy document)

Level 4 — System-Level Checks

# k3s control plane
kubectl get componentstatuses
# On master nodes:
systemctl status k3s

# Cilium (k3s)
kubectl -n kube-system exec ds/cilium -- cilium status
kubectl -n kube-system get pods -l k8s-app=cilium

# Resource pressure (both environments)
kubectl top nodes
kubectl top pods -A --sort-by=memory | head -20

# EKS cluster info
aws eks describe-cluster --name <cluster> --region <region>

Level 5 — Logs via Grafana (k3s)

Grafana: grafana.monitoring.ctz.fyi

Loki log queries:

{namespace="<ns>"}
{namespace="<ns>", app="<name>"} |= "error"
{namespace="<ns>"} | logfmt | level="error"

Mimir (metrics): Check CPU/memory graphs around the time of failure — spikes often correlate with OOMKills or throttling that don't appear in kubectl describe.


Live Debugging Inside a Container

kubectl exec -it <pod> -n <ns> -- /bin/sh
# or if bash available:
kubectl exec -it <pod> -n <ns> -- bash
# multi-container:
kubectl exec -it <pod> -n <ns> -c <container> -- /bin/sh

Use for: verifying env vars, testing connectivity (curl, wget, nslookup), checking mounted files.


Restart vs Dig Deeper

Restart first when:

  • Pod is in unknown/evicted state with no clear cause
  • You've already identified the root cause and fixed it
  • OOMKilled and you're about to bump memory limits

Dig deeper first when:

  • CrashLoopBackOff with no obvious cause (logs will be lost on restart)
  • Data loss risk
  • Same pod keeps restarting after restart → there's a real problem, not a transient one
  • Multiple pods affected → likely systemic, not pod-specific

Never restart ArgoCD-managed resources directly — ArgoCD will re-sync to desired state. Fix the underlying cause (secret, config, image) and let ArgoCD reconcile, or trigger a manual sync.