Runbook Strategy¶
Purpose¶
Runbooks turn static architecture knowledge into repeatable operator action. They should optimize for fast triage and safe recovery, not for exhaustive prose.
Standard response model¶
Every runbook should be organized around the same five operational stages:
- Detect the problem from alerts, symptoms, or user reports.
- Triage scope and blast radius with a short set of standard commands.
- Stabilize the service or reduce impact.
- Recover safely, including data restore if required.
- Review and reconcile, including changelog and documentation updates.
Runbook classes¶
| Class | Typical targets | Mandatory? |
|---|---|---|
| Platform runbooks | fleet, argocd, traefik, calico, metallb, longhorn, velero | Yes |
| Stateful service runbooks | minio, nextcloud, gitea, vaultwarden, gitlab, databases | Yes |
| Security and access runbooks | authelia, crowdsec-lapi, cloudflared | Yes |
| Utility service runbooks | Tier 2 or Tier 3 utilities | Recommended when shared or externally exposed |
Minimum content for every runbook¶
- Metadata and ownership
- Health checks and probe verification
- Common failure-mode diagnostics
- Rollback or stabilization actions
- Disaster recovery steps
- Scaling and resource-tuning guidance
- Post-incident update requirements
Current coverage snapshot¶
Status as of 2026-07-02:
- Current runbooks: 23.
- Existing runbooks are enforced by
scripts/validate_docs.py. - Current service pages without runbooks:
actualbudget,mealie, andollama.
Current runbook gaps¶
Create these runbooks before expanding to lower-priority utilities:
csi-driver-nfslonghornvelerocalicocloudflaredcrowdsec-lapiminiomimirgrafana-agentkube-prometheus-stack
Current runbook set¶
- argocd
- authelia
- defectdojo
- Fleet
- forgejo
- gitea
- gitlab
- grafana-dashboard
- infisical
- k8s-monitoring
- lgtm-distributed
- loki
- Traefik
- MetalLB
- nextcloud
- openvas
- prometheus
- pulp3
- rancher
- renovate
- semaphore
- tailscale-operator
- vaultwarden
Operating rule¶
If a service would cause significant user, data, or platform impact when unavailable, it should not be considered operationally complete without a validated runbook.