traefik Runbook¶

Metadata¶

Field	Value
Service	traefik
Criticality	Tier 0
Owner	Platform / Networking owner
Namespace	traefik
Clusters	homelab via legacy prod overlay, local, jls
Last validated	2026-04-17
Related service page	../services/traefik.md

Trigger Conditions¶

External services become unreachable or return TLS errors.
Dashboard URL stops responding or enters an auth loop.
LoadBalancer external IP is missing or wrong.
Let's Encrypt renewals fail or certificates expire unexpectedly.

1. Health Checks¶

Use these commands first to establish scope.

kubectl -n traefik get deploy,svc,pod,pvc,ingressroute,middleware,tlsoption,tlsstore
kubectl -n traefik get pods -o wide
kubectl -n traefik describe pod -l app.kubernetes.io/app=traefik
kubectl -n traefik logs deploy/traefik --tail=200
kubectl -n traefik get svc traefik -o wide

Probe verification¶

Traefik uses /ping for readiness and liveness on port 80.

kubectl -n traefik describe pod <traefik-pod-name>
kubectl -n traefik port-forward deploy/traefik 9000:80
curl -I http://127.0.0.1:9000/ping

Record:

whether the pod is failing readiness, liveness, or both
whether /ping is healthy but the Service lacks a LoadBalancer IP
whether a middleware or certificate issue is the actual user-facing problem

2. Troubleshooting Workflows¶

LoadBalancer has no external IP or traffic never reaches Traefik¶

kubectl -n traefik get svc traefik -o yaml
kubectl -n metallb-system get ipaddresspools,l2advertisements
kubectl -n metallb-system logs deploy/controller --tail=200
kubectl -n metallb-system logs daemonset/speaker --tail=200

Check:

Service type is still LoadBalancer in the active overlay
the requested address is inside the active MetalLB pool
MetalLB controller and speaker are healthy

Pod restarts or Traefik fails to boot¶

kubectl -n traefik logs deploy/traefik --previous
kubectl -n traefik describe pod <traefik-pod-name>
kubectl -n traefik get configmap traefik-config -o yaml

Check:

invalid static or dynamic config rendered into traefik-config
missing provider secrets such as cloudflare or do-auth-token
PVC mount issues on traefik-pvc

Certificate issuance or renewal fails¶

kubectl -n traefik logs deploy/traefik --tail=400 | grep -i acme
kubectl -n traefik get secret cloudflare do-auth-token cloudflare-origin-cert
kubectl -n traefik get pvc traefik-pvc

Check:

DNS provider token secret is present and correct
ACME storage on traefik-pvc is writable
certResolver names in IngressRoute or TLS config still match the active static configuration

Dashboard route fails or redirects endlessly through auth¶

kubectl -n traefik get ingressroute traefik-dashboard -o yaml
kubectl -n traefik get middleware
kubectl -n auth get pods
kubectl -n traefik logs deploy/traefik --tail=200 | grep -i auth

Check:

dashboard host matches the overlay you deployed
Authelia is reachable on the expected internal address
middleware reference uses the correct provider namespace or suffix for the active overlay

3. Disaster Recovery¶

Preconditions¶

Confirm whether the outage is caused by MetalLB reachability, Traefik config, or provider credential issues.
Identify the active overlay for the affected cluster.
Verify the required secret source files and PVC are available.

Stateful workload recovery¶

Traefik state is small but important because ACME data lives on a PVC.

Preserve or snapshot traefik-pvc when possible before destructive actions.
Restore the active overlay secrets and config files.
Reapply the correct overlay with Kustomize.
Restore the PVC if ACME state was lost and certificate continuity matters.
Validate /ping, external IP assignment, and dashboard access.

Cluster rebuild dependency order¶

Cluster networking and MetalLB
Namespace traefik and required secrets
traefik-pvc or replacement certificate strategy
Traefik deployment and CRDs
Protected routes and dashboard validation

4. Scaling and Resource Management¶

Preferred path: adjust the overlay manifests and roll through GitOps.

Use these commands to size the problem before changing resources:

kubectl -n traefik top pod
kubectl -n traefik get deploy traefik -o yaml
kubectl -n traefik describe deploy traefik

Record:

sustained CPU or memory pressure on the ingress pod
request volume versus access log behavior
whether scaling above one replica would conflict with the current ReadWriteOnce ACME storage model

Current guidance: do not increase replicas casually while the ACME state is stored on a single PVC.

5. Maintenance Procedures¶

Rotate DNS provider tokens.
Rotate dashboard credentials and origin certificates.
Review Traefik image and CRD version alignment before upgrades.
Validate dashboard hostnames and cross-cluster ExternalName routes after overlay changes.

For each task, define:

Preconditions: active overlay identified and backup of provider secrets available
Impact window: possible certificate or routing disruption during rollout
Rollback path: revert overlay revision and reapply previous secrets
Validation steps: /ping, dashboard route, and one representative service route succeed

6. Rollback Strategy¶

Document the fastest safe rollback path:

Revert the affected overlay revision.
Restore the previous secretGenerator source files.
If ACME state was corrupted, restore the previous traefik-pvc snapshot or delete the bad state and allow re-issuance when acceptable.

7. Post-Incident Actions¶

After recovery, always:

Update CHANGELOG.md if traffic was manually rerouted, certificates were restored, or secrets were rotated outside the normal workflow.
Update the Traefik service page if hosts, middleware chains, provider strategy, or metrics wiring changed.
Update this runbook if the incident revealed a missing auth, certificate, or MetalLB diagnostic step.
Capture any follow-up work to remove legacy prod overlay assumptions.