autojanet/skills/investigating-cluster-issue/SKILL.md

---
name: investigating-cluster-issue
description: Use when debugging Kubernetes issues on Zoe's homelab k3s cluster (k3s v1.35, Cilium, Traefik, ArgoCD, OpenBao, Grafana stack) or on AWS EKS clusters — pod failures, sync errors, networking problems, storage issues, node failures, or any unexpected cluster behavior.
---

# Investigating Cluster Issues

## Overview

Systematic triage for Kubernetes problems. Always run Level 1 first to establish ground truth before narrowing down. Resist the urge to jump straight to logs — node and pod status often reveals the real problem faster.

## Environment Reference

**k3s homelab:**
- Nodes: master-01/02/03, worker-01/02, gpu-node
- CNI: Cilium | Ingress: Traefik | GitOps: ArgoCD (`argocd.ctz.fyi`)
- Secrets: External Secrets Operator + OpenBao (`bao.ctz.fyi`)
- Monitoring: Grafana (`grafana.monitoring.ctz.fyi`) — Mimir, Loki, Tempo
- Storage: `ssd` (NFS), `local-path`
- Registry: Harbor (`registry.ctz.fyi`)
- Key namespaces: `argocd`, `monitoring`, `keycloak`, `external-secrets`, `cert-manager`, `traefik`, `openbao`

**EKS:**
- Addons: aws-load-balancer-controller, external-dns, cluster-autoscaler, kube-prometheus-stack
- Storage: EBS CSI (`gp3` preferred), EFS for shared
- Auth: IRSA for pod AWS access
- Networking: aws-vpc-cni or Cilium + Calico network policies

---

## Quick Reference: Symptom → First Command

| Symptom | First command |
|---------|--------------|
| Pod stuck `Pending` | `kubectl describe pod <pod> -n <ns>` → check Events |
| `CrashLoopBackOff` | `kubectl logs <pod> -n <ns> --previous` |
| `ImagePullBackOff` | `kubectl describe pod <pod> -n <ns>` → check image + secret |
| Secret not available | `kubectl get externalsecret -n <ns>` |
| ArgoCD sync failing | `kubectl get application <name> -n argocd -o yaml` → `.status.conditions` |
| TLS cert not issuing | `kubectl get certificate -n <ns>` |
| Node not Ready | `kubectl describe node <name>` → Events + Conditions |
| EKS ALB not creating | `kubectl describe ingress <name> -n <ns>` → check controller logs |
| Cluster-wide chaos | `kubectl get events -A --sort-by='.lastTimestamp' \| tail -30` |
| Not sure where to start | Run all three Level 1 commands |

---

## Level 1 — Immediate Triage (always run first)

```bash
kubectl get nodes -o wide
kubectl get pods -A | grep -Ev '(Running|Completed)'
kubectl get events -A --sort-by='.lastTimestamp' | tail -30
```

Read the events output carefully — it frequently names the exact problem.

---

## Level 2 — Narrow to Failing Resource

```bash
kubectl describe pod <name> -n <ns>        # Events section is the most useful part
kubectl logs <pod> -n <ns> --previous      # If pod restarted
kubectl logs <pod> -n <ns> -c <container>  # Multi-container pods
```

---

## Level 3 — Root Causes by Symptom

### Pod stuck `Pending`

1. Check describe events for `FailedScheduling` — resource constraints, taints/tolerations, affinity rules
2. Check PVCs: `kubectl get pvc -n <ns>`
   - **k3s:** If PVC Pending, check NFS provisioner: `kubectl get pods -n nfs-provisioner`
   - **EKS:** Check EBS CSI driver: `kubectl get pods -n kube-system -l app=ebs-csi-controller`; verify IRSA annotation on ServiceAccount

### `CrashLoopBackOff`

1. `kubectl logs <pod> --previous` — look for panic, missing env var, missing file, bad config
2. Check ExternalSecret synced: `kubectl get externalsecret -n <ns>` — `SecretSyncedError` is common
3. Check dependent services (DB, cache, upstream API)
4. **k3s ArgoCD:** Check sync-wave ordering — ExternalSecret must have lower wave number than Deployment

### ArgoCD sync failing (k3s)

```bash
kubectl get application <name> -n argocd -o yaml   # .status.conditions
kubectl get application <name> -n argocd -o jsonpath='{.status.operationState.message}'
```

- **OutOfSync on immutable field** → manually delete the resource, then re-sync
- **ExternalSecret missing** → check OpenBao (see below)
- Force refresh without sync: ArgoCD UI → hard refresh, or:
  ```bash
  kubectl annotate application <name> -n argocd argocd.argoproj.io/refresh=hard
  ```

### External Secrets not syncing

```bash
kubectl describe externalsecret <name> -n <ns>     # .status.conditions
kubectl get clustersecretstore openbao -o yaml     # check Ready condition
kubectl exec -n openbao openbao-0 -- bao status    # check sealed/unsealed
```

- **OpenBao sealed:** Normally auto-unseals via OCI KMS. If stuck:
  ```bash
  kubectl exec -n openbao openbao-0 -- bao operator unseal
  ```
- **ClusterSecretStore not Ready:** Check the ESO controller logs:
  ```bash
  kubectl logs -n external-secrets deploy/external-secrets -f
  ```

### `ImagePullBackOff`

```bash
kubectl describe pod <name> -n <ns>   # look for "401 Unauthorized" or "not found"
```

- Wrong image tag → fix in manifest/values
- Missing `imagePullSecret` → verify secret exists: `kubectl get secret -n <ns>`
- **k3s Harbor auth:** Ensure secret references `registry.ctz.fyi` and is attached to ServiceAccount or pod spec
- Registry unreachable → check Harbor pod health: `kubectl get pods -n harbor`

### IngressRoute / TLS not working (k3s)

```bash
kubectl get certificate -n <ns>                   # Ready=False = problem
kubectl describe certificate <name> -n <ns>       # check Events
kubectl get ingressroute -n <ns>
kubectl get ingress -n <ns>                       # cert-manager needs a standard Ingress to issue
```

- cert-manager needs a standard `Ingress` resource alongside `IngressRoute` — if missing, cert won't issue
- Check Traefik pods: `kubectl get pods -n traefik`

### EKS — Node not joining

```bash
kubectl get configmap aws-auth -n kube-system -o yaml    # verify node IAM role mapped
# On the node:
journalctl -u kubelet -n 100
```

- Check security groups: nodes need port 443 outbound to control plane endpoint
- Check node IAM role has `AmazonEKSWorkerNodePolicy`, `AmazonEKS_CNI_Policy`, `AmazonEC2ContainerRegistryReadOnly`

### EKS — ALB/NLB not creating

```bash
kubectl describe ingress <name> -n <ns>
kubectl logs -n kube-system deploy/aws-load-balancer-controller | tail -50
```

- Verify annotations: `kubernetes.io/ingress.class: alb`
- Check IRSA: ServiceAccount must have `eks.amazonaws.com/role-arn` annotation
- Check controller has correct IAM permissions (policy document)

---

## Level 4 — System-Level Checks

```bash
# k3s control plane
kubectl get componentstatuses
# On master nodes:
systemctl status k3s

# Cilium (k3s)
kubectl -n kube-system exec ds/cilium -- cilium status
kubectl -n kube-system get pods -l k8s-app=cilium

# Resource pressure (both environments)
kubectl top nodes
kubectl top pods -A --sort-by=memory | head -20

# EKS cluster info
aws eks describe-cluster --name <cluster> --region <region>
```

---

## Level 5 — Logs via Grafana (k3s)

Grafana: `grafana.monitoring.ctz.fyi`

**Loki log queries:**
```
{namespace="<ns>"}
{namespace="<ns>", app="<name>"} |= "error"
{namespace="<ns>"} | logfmt | level="error"
```

**Mimir (metrics):** Check CPU/memory graphs around the time of failure — spikes often correlate with OOMKills or throttling that don't appear in kubectl describe.

---

## Live Debugging Inside a Container

```bash
kubectl exec -it <pod> -n <ns> -- /bin/sh
# or if bash available:
kubectl exec -it <pod> -n <ns> -- bash
# multi-container:
kubectl exec -it <pod> -n <ns> -c <container> -- /bin/sh
```

Use for: verifying env vars, testing connectivity (`curl`, `wget`, `nslookup`), checking mounted files.

---

## Restart vs Dig Deeper

**Restart first when:**
- Pod is in unknown/evicted state with no clear cause
- You've already identified the root cause and fixed it
- OOMKilled and you're about to bump memory limits

**Dig deeper first when:**
- CrashLoopBackOff with no obvious cause (logs will be lost on restart)
- Data loss risk
- Same pod keeps restarting after restart → there's a real problem, not a transient one
- Multiple pods affected → likely systemic, not pod-specific

**Never restart ArgoCD-managed resources directly** — ArgoCD will re-sync to desired state. Fix the underlying cause (secret, config, image) and let ArgoCD reconcile, or trigger a manual sync.