traefik Runbook¶
Metadata¶
| Field | Value |
|---|---|
| Service | traefik |
| Criticality | Tier 0 |
| Owner | Platform / Networking owner |
| Namespace | traefik |
| Clusters | homelab via legacy prod overlay, local, jls |
| Last validated | 2026-04-17 |
| Related service page | ../services/traefik.md |
Trigger Conditions¶
- External services become unreachable or return TLS errors.
- Dashboard URL stops responding or enters an auth loop.
- LoadBalancer external IP is missing or wrong.
- Let's Encrypt renewals fail or certificates expire unexpectedly.
1. Health Checks¶
Use these commands first to establish scope.
kubectl -n traefik get deploy,svc,pod,pvc,ingressroute,middleware,tlsoption,tlsstore
kubectl -n traefik get pods -o wide
kubectl -n traefik describe pod -l app.kubernetes.io/app=traefik
kubectl -n traefik logs deploy/traefik --tail=200
kubectl -n traefik get svc traefik -o wide
Probe verification¶
Traefik uses /ping for readiness and liveness on port 80.
kubectl -n traefik describe pod <traefik-pod-name>
kubectl -n traefik port-forward deploy/traefik 9000:80
curl -I http://127.0.0.1:9000/ping
Record:
- whether the pod is failing readiness, liveness, or both
- whether /ping is healthy but the Service lacks a LoadBalancer IP
- whether a middleware or certificate issue is the actual user-facing problem
2. Troubleshooting Workflows¶
LoadBalancer has no external IP or traffic never reaches Traefik¶
kubectl -n traefik get svc traefik -o yaml
kubectl -n metallb-system get ipaddresspools,l2advertisements
kubectl -n metallb-system logs deploy/controller --tail=200
kubectl -n metallb-system logs daemonset/speaker --tail=200
Check:
- Service type is still LoadBalancer in the active overlay
- the requested address is inside the active MetalLB pool
- MetalLB controller and speaker are healthy
Pod restarts or Traefik fails to boot¶
kubectl -n traefik logs deploy/traefik --previous
kubectl -n traefik describe pod <traefik-pod-name>
kubectl -n traefik get configmap traefik-config -o yaml
Check:
- invalid static or dynamic config rendered into traefik-config
- missing provider secrets such as cloudflare or do-auth-token
- PVC mount issues on traefik-pvc
Certificate issuance or renewal fails¶
kubectl -n traefik logs deploy/traefik --tail=400 | grep -i acme
kubectl -n traefik get secret cloudflare do-auth-token cloudflare-origin-cert
kubectl -n traefik get pvc traefik-pvc
Check:
- DNS provider token secret is present and correct
- ACME storage on traefik-pvc is writable
- certResolver names in IngressRoute or TLS config still match the active static configuration
Dashboard route fails or redirects endlessly through auth¶
kubectl -n traefik get ingressroute traefik-dashboard -o yaml
kubectl -n traefik get middleware
kubectl -n auth get pods
kubectl -n traefik logs deploy/traefik --tail=200 | grep -i auth
Check:
- dashboard host matches the overlay you deployed
- Authelia is reachable on the expected internal address
- middleware reference uses the correct provider namespace or suffix for the active overlay
3. Disaster Recovery¶
Preconditions¶
- Confirm whether the outage is caused by MetalLB reachability, Traefik config, or provider credential issues.
- Identify the active overlay for the affected cluster.
- Verify the required secret source files and PVC are available.
Stateful workload recovery¶
Traefik state is small but important because ACME data lives on a PVC.
- Preserve or snapshot traefik-pvc when possible before destructive actions.
- Restore the active overlay secrets and config files.
- Reapply the correct overlay with Kustomize.
- Restore the PVC if ACME state was lost and certificate continuity matters.
- Validate /ping, external IP assignment, and dashboard access.
Cluster rebuild dependency order¶
- Cluster networking and MetalLB
- Namespace traefik and required secrets
- traefik-pvc or replacement certificate strategy
- Traefik deployment and CRDs
- Protected routes and dashboard validation
4. Scaling and Resource Management¶
Preferred path: adjust the overlay manifests and roll through GitOps.
Use these commands to size the problem before changing resources:
kubectl -n traefik top pod
kubectl -n traefik get deploy traefik -o yaml
kubectl -n traefik describe deploy traefik
Record:
- sustained CPU or memory pressure on the ingress pod
- request volume versus access log behavior
- whether scaling above one replica would conflict with the current ReadWriteOnce ACME storage model
Current guidance: do not increase replicas casually while the ACME state is stored on a single PVC.
5. Maintenance Procedures¶
- Rotate DNS provider tokens.
- Rotate dashboard credentials and origin certificates.
- Review Traefik image and CRD version alignment before upgrades.
- Validate dashboard hostnames and cross-cluster ExternalName routes after overlay changes.
For each task, define:
- Preconditions: active overlay identified and backup of provider secrets available
- Impact window: possible certificate or routing disruption during rollout
- Rollback path: revert overlay revision and reapply previous secrets
- Validation steps: /ping, dashboard route, and one representative service route succeed
6. Rollback Strategy¶
Document the fastest safe rollback path:
- Revert the affected overlay revision.
- Restore the previous secretGenerator source files.
- If ACME state was corrupted, restore the previous traefik-pvc snapshot or delete the bad state and allow re-issuance when acceptable.
7. Post-Incident Actions¶
After recovery, always:
- Update CHANGELOG.md if traffic was manually rerouted, certificates were restored, or secrets were rotated outside the normal workflow.
- Update the Traefik service page if hosts, middleware chains, provider strategy, or metrics wiring changed.
- Update this runbook if the incident revealed a missing auth, certificate, or MetalLB diagnostic step.
- Capture any follow-up work to remove legacy prod overlay assumptions.