authelia Runbook¶
Metadata¶
| Field | Value |
|---|---|
| Service | authelia |
| Criticality | Tier 1 |
| Owner | Platform / Identity owner |
| Namespace | auth |
| Clusters | homelab, local, jls |
| Last validated | 2026-06-29 |
| Related service page | ../services/authelia.md |
Trigger Conditions¶
- Protected apps return authentication errors or redirect loops.
- The Authelia portal is unavailable.
- SMTP or identity flows fail.
- MariaDB or session state becomes unhealthy.
1. Health Checks¶
kubectl -n auth get pods,svc,pvc,ingressroute
kubectl -n auth logs deploy/authelia --tail=200
kubectl -n auth get statefulset,deploy
kubectl -n auth exec deploy/authelia -- authelia validate-config --config /config/configuration.yaml
2. Troubleshooting Workflows¶
Check app, DB, and ingress together:
kubectl -n auth describe deploy authelia
kubectl -n auth describe statefulset authelia-db
kubectl -n auth logs statefulset/authelia-db --tail=100
Look for:
- bad DB credentials or missing secrets
- broken Traefik middleware references
- SMTP endpoint or certificate errors
- missing TOTP enrollment after MFA policy changes
- Duo API hostname, integration key, or secret key problems
- stale sessions after Authelia secret rotation
3. Disaster Recovery¶
- Restore secrets and DB credentials.
- Recover MariaDB state or PVC snapshot.
- Reconcile the active overlay.
- Validate forward-auth against a protected application.
4. Scaling and Resource Management¶
Increase app or DB resources in Git if logins or session checks become resource-bound.
Current Git resource profile:
- Authelia: 100m CPU request, 128Mi memory request, 512Mi memory limit.
- Authelia local overlay: 100m CPU request, 128Mi memory request, 1Gi memory limit.
- MariaDB: 50m CPU request, 128Mi memory request, 512Mi memory limit.
5. Maintenance Procedures¶
- Rotate auth and DB secrets.
- Rotate Duo API material through the
autheliaSecret, then validate phone push. - Confirm TOTP enrollment for users before tightening protected service routes.
- Test SMTP and reset flows after mail changes.
- Recheck middleware names after Traefik refactors.
6. Rollback Strategy¶
- Revert the active overlay to the previous working revision.
- Restore the previous DB snapshot if schema or startup changes failed.
- If users are locked out by MFA policy, temporarily roll back the access-control rule change rather than editing secrets in place.
- Keep trusted-network bypass rules available for internal services that cannot complete a 2FA flow.
7. Post-Incident Actions¶
- Record manual recovery in a changelog fragment.
- Update the service page if auth flows or dependencies changed.
- Update this runbook with the exact failure signature and fix.