autojanet/skills/network-debugging/SKILL.md
Zoë cc74ad0bd0
Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
fix: use library/ Harbor project, add skills, fix pipeline secrets
- .woodpecker.yaml: image paths -> library/autojanet-{agent,dispatcher}
- .woodpecker.yaml: secret names RS_HARBOR_USER / RS_HARBOR_PASS (global)
- container/Dockerfile: restore COPY skills/, skills/ populated from opencode config
- skills/: 84 opencode skills bundled into image
- k8s/manifests: update image refs to library/
2026-05-30 15:43:14 -07:00

5 KiB

name description
network-debugging Use when diagnosing network connectivity issues in Zoe's homelab or work environments — DNS not resolving, TLS cert stuck, service unreachable, ingress not routing, Cilium dropping packets, or Pangolin tunnel not working.

Network Debugging

Overview

Systematic outside-in debugging for Zoe's homelab stack: DigitalOcean DNS + BIND9 split-horizon, cert-manager DNS-01, Traefik IngressRoute, Cilium CNI, and Pangolin tunnels.

Rule: Always work from outside in. DNS → TLS → Ingress → Pod → Cilium → Pangolin.

Quick Symptom → First Command

Symptom First command
Can't reach service from browser dig <hostname> @8.8.8.8
Certificate expired / not trusted kubectl get certificate -n <ns>
cert-manager stuck in Pending kubectl get challenge -A
Service resolves but connection refused kubectl get endpoints <svc> -n <ns>
Works internally, not externally Check Pangolin annotations + external-dns target
Works externally, not from cluster kubectl run nettest --image=nicolaka/netshoot
Pod can't reach external internet Check Cilium NetworkPolicy egress rules
DNS resolves wrong IP Compare dig @8.8.8.8 vs dig @10.0.6.6 (split-horizon issue)

Level 1: DNS

# Public DNS
dig <hostname> @8.8.8.8
dig <hostname> @ns1.digitalocean.com

# Internal DNS (from within cluster)
kubectl run -it --rm dnsutils --image=busybox --restart=Never -- nslookup <hostname>

# ACME challenge record (cert-manager DNS-01)
dig TXT _acme-challenge.<hostname> @ns1.digitalocean.com

# ExternalDNS registration
kubectl logs -n external-dns -l app.kubernetes.io/name=external-dns | tail -20

Stack: DigitalOcean (ctz.fyi public) + BIND9 (10.0.6.6, split-horizon internal) Public NS: ns1/ns2/ns3.digitalocean.com Domains: *.ctz.fyi (public), *.i.ctz.fyi (internal only)

Level 2: TLS / cert-manager

# Certificate status
kubectl get certificate -n <namespace>
kubectl describe certificate <name> -n <namespace>

# Active ACME challenge
kubectl get challenge -A
kubectl describe challenge <name> -n <namespace>

# cert-manager errors
kubectl logs -n cert-manager deploy/cert-manager | grep -i error | tail -20

# Verify cert in secret
kubectl get secret <name>-tls -n <namespace> \
  -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -dates

Common issue: cert-manager can't create DNS TXT record

  • Check DigitalOcean token: kubectl get secret digitalocean-dns -n cert-manager
  • Check outbound UDP 53 — Cilium NetworkPolicy may block cert-manager egress

Level 3: Ingress / Traefik

# Check IngressRoute
kubectl get ingressroute -n <namespace> -o yaml

# Traefik logs for hostname
kubectl logs -n traefik deploy/traefik | grep <hostname>

Critical gotcha: cert-manager reads Ingress objects, not IngressRoute CRDs. You must have both:

  • IngressRoute — actual routing
  • Ingress — cert-manager TLS issuance + external-dns registration

Missing the companion Ingress = cert never issued, hostname never registered.

Level 4: Pod Connectivity

# Test from inside cluster
kubectl run -it --rm nettest --image=nicolaka/netshoot --restart=Never -- bash
# curl http://<service>.<namespace>.svc.cluster.local
# nslookup <service>.<namespace>.svc.cluster.local
# curl -v https://<external-hostname>

# Check service has endpoints (pod actually behind service?)
kubectl get endpoints <service> -n <namespace>

Level 5: Cilium

# Cilium status
kubectl exec -n kube-system ds/cilium -- cilium status

# Dropped flows
kubectl exec -n kube-system ds/cilium -- \
  hubble observe --namespace <ns> --verdict DROPPED

# Active policies
kubectl get networkpolicy -n <namespace>
kubectl get ciliumnetworkpolicy -n <namespace>

# Pod identity
kubectl exec -n kube-system ds/cilium -- cilium endpoint list | grep <pod-ip>

Level 6: Pangolin Tunnel

# Check annotations on IngressRoute
kubectl get ingressroute <name> -n <namespace> -o yaml | grep pangolin

# Pangolin/Newt pod health
kubectl get pods -n pangolin
kubectl logs -n pangolin <newt-pod>

Required annotations for Pangolin-routed services:

annotations:
  pangolin.fossorial.io/enabled: "true"
  external-dns.alpha.kubernetes.io/target: "external"

EKS / Cloud Extras

# CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns

# Security group check
aws ec2 describe-security-groups --group-ids sg-xxxx

Also check: VPC flow logs, ALB access logs, inbound/outbound security group rules.

Common Mistakes

Mistake Fix
Only created IngressRoute, no Ingress Add companion Ingress for cert-manager + external-dns
cert-manager can't do DNS-01 Check DigitalOcean API token secret exists in cert-manager ns
Split-horizon confusion Always compare @8.8.8.8 vs @10.0.6.6 explicitly
Pangolin service not externally reachable Verify both annotations are present
Cilium blocking cert-manager Check egress NetworkPolicy for UDP 53 and TCP 443