autojanet/skills/network-debugging/SKILL.md
Zoë cc74ad0bd0
Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
fix: use library/ Harbor project, add skills, fix pipeline secrets
- .woodpecker.yaml: image paths -> library/autojanet-{agent,dispatcher}
- .woodpecker.yaml: secret names RS_HARBOR_USER / RS_HARBOR_PASS (global)
- container/Dockerfile: restore COPY skills/, skills/ populated from opencode config
- skills/: 84 opencode skills bundled into image
- k8s/manifests: update image refs to library/
2026-05-30 15:43:14 -07:00

157 lines
5 KiB
Markdown

---
name: network-debugging
description: Use when diagnosing network connectivity issues in Zoe's homelab or work environments — DNS not resolving, TLS cert stuck, service unreachable, ingress not routing, Cilium dropping packets, or Pangolin tunnel not working.
---
# Network Debugging
## Overview
Systematic outside-in debugging for Zoe's homelab stack: DigitalOcean DNS + BIND9 split-horizon, cert-manager DNS-01, Traefik IngressRoute, Cilium CNI, and Pangolin tunnels.
**Rule:** Always work from outside in. DNS → TLS → Ingress → Pod → Cilium → Pangolin.
## Quick Symptom → First Command
| Symptom | First command |
|---------|---------------|
| Can't reach service from browser | `dig <hostname> @8.8.8.8` |
| Certificate expired / not trusted | `kubectl get certificate -n <ns>` |
| cert-manager stuck in Pending | `kubectl get challenge -A` |
| Service resolves but connection refused | `kubectl get endpoints <svc> -n <ns>` |
| Works internally, not externally | Check Pangolin annotations + external-dns target |
| Works externally, not from cluster | `kubectl run nettest --image=nicolaka/netshoot` |
| Pod can't reach external internet | Check Cilium NetworkPolicy egress rules |
| DNS resolves wrong IP | Compare `dig @8.8.8.8` vs `dig @10.0.6.6` (split-horizon issue) |
## Level 1: DNS
```bash
# Public DNS
dig <hostname> @8.8.8.8
dig <hostname> @ns1.digitalocean.com
# Internal DNS (from within cluster)
kubectl run -it --rm dnsutils --image=busybox --restart=Never -- nslookup <hostname>
# ACME challenge record (cert-manager DNS-01)
dig TXT _acme-challenge.<hostname> @ns1.digitalocean.com
# ExternalDNS registration
kubectl logs -n external-dns -l app.kubernetes.io/name=external-dns | tail -20
```
**Stack:** DigitalOcean (ctz.fyi public) + BIND9 (10.0.6.6, split-horizon internal)
**Public NS:** ns1/ns2/ns3.digitalocean.com
**Domains:** `*.ctz.fyi` (public), `*.i.ctz.fyi` (internal only)
## Level 2: TLS / cert-manager
```bash
# Certificate status
kubectl get certificate -n <namespace>
kubectl describe certificate <name> -n <namespace>
# Active ACME challenge
kubectl get challenge -A
kubectl describe challenge <name> -n <namespace>
# cert-manager errors
kubectl logs -n cert-manager deploy/cert-manager | grep -i error | tail -20
# Verify cert in secret
kubectl get secret <name>-tls -n <namespace> \
-o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -dates
```
**Common issue:** cert-manager can't create DNS TXT record
- Check DigitalOcean token: `kubectl get secret digitalocean-dns -n cert-manager`
- Check outbound UDP 53 — Cilium NetworkPolicy may block cert-manager egress
## Level 3: Ingress / Traefik
```bash
# Check IngressRoute
kubectl get ingressroute -n <namespace> -o yaml
# Traefik logs for hostname
kubectl logs -n traefik deploy/traefik | grep <hostname>
```
**Critical gotcha:** cert-manager reads `Ingress` objects, not `IngressRoute` CRDs.
You **must** have both:
- `IngressRoute` — actual routing
- `Ingress` — cert-manager TLS issuance + external-dns registration
Missing the companion `Ingress` = cert never issued, hostname never registered.
## Level 4: Pod Connectivity
```bash
# Test from inside cluster
kubectl run -it --rm nettest --image=nicolaka/netshoot --restart=Never -- bash
# curl http://<service>.<namespace>.svc.cluster.local
# nslookup <service>.<namespace>.svc.cluster.local
# curl -v https://<external-hostname>
# Check service has endpoints (pod actually behind service?)
kubectl get endpoints <service> -n <namespace>
```
## Level 5: Cilium
```bash
# Cilium status
kubectl exec -n kube-system ds/cilium -- cilium status
# Dropped flows
kubectl exec -n kube-system ds/cilium -- \
hubble observe --namespace <ns> --verdict DROPPED
# Active policies
kubectl get networkpolicy -n <namespace>
kubectl get ciliumnetworkpolicy -n <namespace>
# Pod identity
kubectl exec -n kube-system ds/cilium -- cilium endpoint list | grep <pod-ip>
```
## Level 6: Pangolin Tunnel
```bash
# Check annotations on IngressRoute
kubectl get ingressroute <name> -n <namespace> -o yaml | grep pangolin
# Pangolin/Newt pod health
kubectl get pods -n pangolin
kubectl logs -n pangolin <newt-pod>
```
**Required annotations for Pangolin-routed services:**
```yaml
annotations:
pangolin.fossorial.io/enabled: "true"
external-dns.alpha.kubernetes.io/target: "external"
```
## EKS / Cloud Extras
```bash
# CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns
# Security group check
aws ec2 describe-security-groups --group-ids sg-xxxx
```
Also check: VPC flow logs, ALB access logs, inbound/outbound security group rules.
## Common Mistakes
| Mistake | Fix |
|---------|-----|
| Only created `IngressRoute`, no `Ingress` | Add companion `Ingress` for cert-manager + external-dns |
| cert-manager can't do DNS-01 | Check DigitalOcean API token secret exists in cert-manager ns |
| Split-horizon confusion | Always compare `@8.8.8.8` vs `@10.0.6.6` explicitly |
| Pangolin service not externally reachable | Verify both annotations are present |
| Cilium blocking cert-manager | Check egress NetworkPolicy for UDP 53 and TCP 443 |