Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
- .woodpecker.yaml: image paths -> library/autojanet-{agent,dispatcher}
- .woodpecker.yaml: secret names RS_HARBOR_USER / RS_HARBOR_PASS (global)
- container/Dockerfile: restore COPY skills/, skills/ populated from opencode config
- skills/: 84 opencode skills bundled into image
- k8s/manifests: update image refs to library/
228 lines
8 KiB
Markdown
228 lines
8 KiB
Markdown
---
|
|
name: investigating-cluster-issue
|
|
description: Use when debugging Kubernetes issues on Zoe's homelab k3s cluster (k3s v1.35, Cilium, Traefik, ArgoCD, OpenBao, Grafana stack) or on AWS EKS clusters — pod failures, sync errors, networking problems, storage issues, node failures, or any unexpected cluster behavior.
|
|
---
|
|
|
|
# Investigating Cluster Issues
|
|
|
|
## Overview
|
|
|
|
Systematic triage for Kubernetes problems. Always run Level 1 first to establish ground truth before narrowing down. Resist the urge to jump straight to logs — node and pod status often reveals the real problem faster.
|
|
|
|
## Environment Reference
|
|
|
|
**k3s homelab:**
|
|
- Nodes: master-01/02/03, worker-01/02, gpu-node
|
|
- CNI: Cilium | Ingress: Traefik | GitOps: ArgoCD (`argocd.ctz.fyi`)
|
|
- Secrets: External Secrets Operator + OpenBao (`bao.ctz.fyi`)
|
|
- Monitoring: Grafana (`grafana.monitoring.ctz.fyi`) — Mimir, Loki, Tempo
|
|
- Storage: `ssd` (NFS), `local-path`
|
|
- Registry: Harbor (`registry.ctz.fyi`)
|
|
- Key namespaces: `argocd`, `monitoring`, `keycloak`, `external-secrets`, `cert-manager`, `traefik`, `openbao`
|
|
|
|
**EKS:**
|
|
- Addons: aws-load-balancer-controller, external-dns, cluster-autoscaler, kube-prometheus-stack
|
|
- Storage: EBS CSI (`gp3` preferred), EFS for shared
|
|
- Auth: IRSA for pod AWS access
|
|
- Networking: aws-vpc-cni or Cilium + Calico network policies
|
|
|
|
---
|
|
|
|
## Quick Reference: Symptom → First Command
|
|
|
|
| Symptom | First command |
|
|
|---------|--------------|
|
|
| Pod stuck `Pending` | `kubectl describe pod <pod> -n <ns>` → check Events |
|
|
| `CrashLoopBackOff` | `kubectl logs <pod> -n <ns> --previous` |
|
|
| `ImagePullBackOff` | `kubectl describe pod <pod> -n <ns>` → check image + secret |
|
|
| Secret not available | `kubectl get externalsecret -n <ns>` |
|
|
| ArgoCD sync failing | `kubectl get application <name> -n argocd -o yaml` → `.status.conditions` |
|
|
| TLS cert not issuing | `kubectl get certificate -n <ns>` |
|
|
| Node not Ready | `kubectl describe node <name>` → Events + Conditions |
|
|
| EKS ALB not creating | `kubectl describe ingress <name> -n <ns>` → check controller logs |
|
|
| Cluster-wide chaos | `kubectl get events -A --sort-by='.lastTimestamp' \| tail -30` |
|
|
| Not sure where to start | Run all three Level 1 commands |
|
|
|
|
---
|
|
|
|
## Level 1 — Immediate Triage (always run first)
|
|
|
|
```bash
|
|
kubectl get nodes -o wide
|
|
kubectl get pods -A | grep -Ev '(Running|Completed)'
|
|
kubectl get events -A --sort-by='.lastTimestamp' | tail -30
|
|
```
|
|
|
|
Read the events output carefully — it frequently names the exact problem.
|
|
|
|
---
|
|
|
|
## Level 2 — Narrow to Failing Resource
|
|
|
|
```bash
|
|
kubectl describe pod <name> -n <ns> # Events section is the most useful part
|
|
kubectl logs <pod> -n <ns> --previous # If pod restarted
|
|
kubectl logs <pod> -n <ns> -c <container> # Multi-container pods
|
|
```
|
|
|
|
---
|
|
|
|
## Level 3 — Root Causes by Symptom
|
|
|
|
### Pod stuck `Pending`
|
|
|
|
1. Check describe events for `FailedScheduling` — resource constraints, taints/tolerations, affinity rules
|
|
2. Check PVCs: `kubectl get pvc -n <ns>`
|
|
- **k3s:** If PVC Pending, check NFS provisioner: `kubectl get pods -n nfs-provisioner`
|
|
- **EKS:** Check EBS CSI driver: `kubectl get pods -n kube-system -l app=ebs-csi-controller`; verify IRSA annotation on ServiceAccount
|
|
|
|
### `CrashLoopBackOff`
|
|
|
|
1. `kubectl logs <pod> --previous` — look for panic, missing env var, missing file, bad config
|
|
2. Check ExternalSecret synced: `kubectl get externalsecret -n <ns>` — `SecretSyncedError` is common
|
|
3. Check dependent services (DB, cache, upstream API)
|
|
4. **k3s ArgoCD:** Check sync-wave ordering — ExternalSecret must have lower wave number than Deployment
|
|
|
|
### ArgoCD sync failing (k3s)
|
|
|
|
```bash
|
|
kubectl get application <name> -n argocd -o yaml # .status.conditions
|
|
kubectl get application <name> -n argocd -o jsonpath='{.status.operationState.message}'
|
|
```
|
|
|
|
- **OutOfSync on immutable field** → manually delete the resource, then re-sync
|
|
- **ExternalSecret missing** → check OpenBao (see below)
|
|
- Force refresh without sync: ArgoCD UI → hard refresh, or:
|
|
```bash
|
|
kubectl annotate application <name> -n argocd argocd.argoproj.io/refresh=hard
|
|
```
|
|
|
|
### External Secrets not syncing
|
|
|
|
```bash
|
|
kubectl describe externalsecret <name> -n <ns> # .status.conditions
|
|
kubectl get clustersecretstore openbao -o yaml # check Ready condition
|
|
kubectl exec -n openbao openbao-0 -- bao status # check sealed/unsealed
|
|
```
|
|
|
|
- **OpenBao sealed:** Normally auto-unseals via OCI KMS. If stuck:
|
|
```bash
|
|
kubectl exec -n openbao openbao-0 -- bao operator unseal
|
|
```
|
|
- **ClusterSecretStore not Ready:** Check the ESO controller logs:
|
|
```bash
|
|
kubectl logs -n external-secrets deploy/external-secrets -f
|
|
```
|
|
|
|
### `ImagePullBackOff`
|
|
|
|
```bash
|
|
kubectl describe pod <name> -n <ns> # look for "401 Unauthorized" or "not found"
|
|
```
|
|
|
|
- Wrong image tag → fix in manifest/values
|
|
- Missing `imagePullSecret` → verify secret exists: `kubectl get secret -n <ns>`
|
|
- **k3s Harbor auth:** Ensure secret references `registry.ctz.fyi` and is attached to ServiceAccount or pod spec
|
|
- Registry unreachable → check Harbor pod health: `kubectl get pods -n harbor`
|
|
|
|
### IngressRoute / TLS not working (k3s)
|
|
|
|
```bash
|
|
kubectl get certificate -n <ns> # Ready=False = problem
|
|
kubectl describe certificate <name> -n <ns> # check Events
|
|
kubectl get ingressroute -n <ns>
|
|
kubectl get ingress -n <ns> # cert-manager needs a standard Ingress to issue
|
|
```
|
|
|
|
- cert-manager needs a standard `Ingress` resource alongside `IngressRoute` — if missing, cert won't issue
|
|
- Check Traefik pods: `kubectl get pods -n traefik`
|
|
|
|
### EKS — Node not joining
|
|
|
|
```bash
|
|
kubectl get configmap aws-auth -n kube-system -o yaml # verify node IAM role mapped
|
|
# On the node:
|
|
journalctl -u kubelet -n 100
|
|
```
|
|
|
|
- Check security groups: nodes need port 443 outbound to control plane endpoint
|
|
- Check node IAM role has `AmazonEKSWorkerNodePolicy`, `AmazonEKS_CNI_Policy`, `AmazonEC2ContainerRegistryReadOnly`
|
|
|
|
### EKS — ALB/NLB not creating
|
|
|
|
```bash
|
|
kubectl describe ingress <name> -n <ns>
|
|
kubectl logs -n kube-system deploy/aws-load-balancer-controller | tail -50
|
|
```
|
|
|
|
- Verify annotations: `kubernetes.io/ingress.class: alb`
|
|
- Check IRSA: ServiceAccount must have `eks.amazonaws.com/role-arn` annotation
|
|
- Check controller has correct IAM permissions (policy document)
|
|
|
|
---
|
|
|
|
## Level 4 — System-Level Checks
|
|
|
|
```bash
|
|
# k3s control plane
|
|
kubectl get componentstatuses
|
|
# On master nodes:
|
|
systemctl status k3s
|
|
|
|
# Cilium (k3s)
|
|
kubectl -n kube-system exec ds/cilium -- cilium status
|
|
kubectl -n kube-system get pods -l k8s-app=cilium
|
|
|
|
# Resource pressure (both environments)
|
|
kubectl top nodes
|
|
kubectl top pods -A --sort-by=memory | head -20
|
|
|
|
# EKS cluster info
|
|
aws eks describe-cluster --name <cluster> --region <region>
|
|
```
|
|
|
|
---
|
|
|
|
## Level 5 — Logs via Grafana (k3s)
|
|
|
|
Grafana: `grafana.monitoring.ctz.fyi`
|
|
|
|
**Loki log queries:**
|
|
```
|
|
{namespace="<ns>"}
|
|
{namespace="<ns>", app="<name>"} |= "error"
|
|
{namespace="<ns>"} | logfmt | level="error"
|
|
```
|
|
|
|
**Mimir (metrics):** Check CPU/memory graphs around the time of failure — spikes often correlate with OOMKills or throttling that don't appear in kubectl describe.
|
|
|
|
---
|
|
|
|
## Live Debugging Inside a Container
|
|
|
|
```bash
|
|
kubectl exec -it <pod> -n <ns> -- /bin/sh
|
|
# or if bash available:
|
|
kubectl exec -it <pod> -n <ns> -- bash
|
|
# multi-container:
|
|
kubectl exec -it <pod> -n <ns> -c <container> -- /bin/sh
|
|
```
|
|
|
|
Use for: verifying env vars, testing connectivity (`curl`, `wget`, `nslookup`), checking mounted files.
|
|
|
|
---
|
|
|
|
## Restart vs Dig Deeper
|
|
|
|
**Restart first when:**
|
|
- Pod is in unknown/evicted state with no clear cause
|
|
- You've already identified the root cause and fixed it
|
|
- OOMKilled and you're about to bump memory limits
|
|
|
|
**Dig deeper first when:**
|
|
- CrashLoopBackOff with no obvious cause (logs will be lost on restart)
|
|
- Data loss risk
|
|
- Same pod keeps restarting after restart → there's a real problem, not a transient one
|
|
- Multiple pods affected → likely systemic, not pod-specific
|
|
|
|
**Never restart ArgoCD-managed resources directly** — ArgoCD will re-sync to desired state. Fix the underlying cause (secret, config, image) and let ArgoCD reconcile, or trigger a manual sync.
|