Skip to content

traefik Runbook

Metadata

Field Value
Service traefik
Criticality Tier 0
Owner Platform / Networking owner
Namespace traefik
Clusters homelab via legacy prod overlay, local, jls
Last validated 2026-04-17
Related service page ../services/traefik.md

Trigger Conditions

  • External services become unreachable or return TLS errors.
  • Dashboard URL stops responding or enters an auth loop.
  • LoadBalancer external IP is missing or wrong.
  • Let's Encrypt renewals fail or certificates expire unexpectedly.

1. Health Checks

Use these commands first to establish scope.

kubectl -n traefik get deploy,svc,pod,pvc,ingressroute,middleware,tlsoption,tlsstore
kubectl -n traefik get pods -o wide
kubectl -n traefik describe pod -l app.kubernetes.io/app=traefik
kubectl -n traefik logs deploy/traefik --tail=200
kubectl -n traefik get svc traefik -o wide

Probe verification

Traefik uses /ping for readiness and liveness on port 80.

kubectl -n traefik describe pod <traefik-pod-name>
kubectl -n traefik port-forward deploy/traefik 9000:80
curl -I http://127.0.0.1:9000/ping

Record:

  • whether the pod is failing readiness, liveness, or both
  • whether /ping is healthy but the Service lacks a LoadBalancer IP
  • whether a middleware or certificate issue is the actual user-facing problem

2. Troubleshooting Workflows

LoadBalancer has no external IP or traffic never reaches Traefik

kubectl -n traefik get svc traefik -o yaml
kubectl -n metallb-system get ipaddresspools,l2advertisements
kubectl -n metallb-system logs deploy/controller --tail=200
kubectl -n metallb-system logs daemonset/speaker --tail=200

Check:

  • Service type is still LoadBalancer in the active overlay
  • the requested address is inside the active MetalLB pool
  • MetalLB controller and speaker are healthy

Pod restarts or Traefik fails to boot

kubectl -n traefik logs deploy/traefik --previous
kubectl -n traefik describe pod <traefik-pod-name>
kubectl -n traefik get configmap traefik-config -o yaml

Check:

  • invalid static or dynamic config rendered into traefik-config
  • missing provider secrets such as cloudflare or do-auth-token
  • PVC mount issues on traefik-pvc

Certificate issuance or renewal fails

kubectl -n traefik logs deploy/traefik --tail=400 | grep -i acme
kubectl -n traefik get secret cloudflare do-auth-token cloudflare-origin-cert
kubectl -n traefik get pvc traefik-pvc

Check:

  • DNS provider token secret is present and correct
  • ACME storage on traefik-pvc is writable
  • certResolver names in IngressRoute or TLS config still match the active static configuration

Dashboard route fails or redirects endlessly through auth

kubectl -n traefik get ingressroute traefik-dashboard -o yaml
kubectl -n traefik get middleware
kubectl -n auth get pods
kubectl -n traefik logs deploy/traefik --tail=200 | grep -i auth

Check:

  • dashboard host matches the overlay you deployed
  • Authelia is reachable on the expected internal address
  • middleware reference uses the correct provider namespace or suffix for the active overlay

3. Disaster Recovery

Preconditions

  • Confirm whether the outage is caused by MetalLB reachability, Traefik config, or provider credential issues.
  • Identify the active overlay for the affected cluster.
  • Verify the required secret source files and PVC are available.

Stateful workload recovery

Traefik state is small but important because ACME data lives on a PVC.

  1. Preserve or snapshot traefik-pvc when possible before destructive actions.
  2. Restore the active overlay secrets and config files.
  3. Reapply the correct overlay with Kustomize.
  4. Restore the PVC if ACME state was lost and certificate continuity matters.
  5. Validate /ping, external IP assignment, and dashboard access.

Cluster rebuild dependency order

  1. Cluster networking and MetalLB
  2. Namespace traefik and required secrets
  3. traefik-pvc or replacement certificate strategy
  4. Traefik deployment and CRDs
  5. Protected routes and dashboard validation

4. Scaling and Resource Management

Preferred path: adjust the overlay manifests and roll through GitOps.

Use these commands to size the problem before changing resources:

kubectl -n traefik top pod
kubectl -n traefik get deploy traefik -o yaml
kubectl -n traefik describe deploy traefik

Record:

  • sustained CPU or memory pressure on the ingress pod
  • request volume versus access log behavior
  • whether scaling above one replica would conflict with the current ReadWriteOnce ACME storage model

Current guidance: do not increase replicas casually while the ACME state is stored on a single PVC.

5. Maintenance Procedures

  • Rotate DNS provider tokens.
  • Rotate dashboard credentials and origin certificates.
  • Review Traefik image and CRD version alignment before upgrades.
  • Validate dashboard hostnames and cross-cluster ExternalName routes after overlay changes.

For each task, define:

  • Preconditions: active overlay identified and backup of provider secrets available
  • Impact window: possible certificate or routing disruption during rollout
  • Rollback path: revert overlay revision and reapply previous secrets
  • Validation steps: /ping, dashboard route, and one representative service route succeed

6. Rollback Strategy

Document the fastest safe rollback path:

  • Revert the affected overlay revision.
  • Restore the previous secretGenerator source files.
  • If ACME state was corrupted, restore the previous traefik-pvc snapshot or delete the bad state and allow re-issuance when acceptable.

7. Post-Incident Actions

After recovery, always:

  1. Update CHANGELOG.md if traffic was manually rerouted, certificates were restored, or secrets were rotated outside the normal workflow.
  2. Update the Traefik service page if hosts, middleware chains, provider strategy, or metrics wiring changed.
  3. Update this runbook if the incident revealed a missing auth, certificate, or MetalLB diagnostic step.
  4. Capture any follow-up work to remove legacy prod overlay assumptions.