Runbook Strategy¶

Purpose¶

Runbooks turn static architecture knowledge into repeatable operator action. They should optimize for fast triage and safe recovery, not for exhaustive prose.

Standard response model¶

Every runbook should be organized around the same five operational stages:

Detect the problem from alerts, symptoms, or user reports.
Triage scope and blast radius with a short set of standard commands.
Stabilize the service or reduce impact.
Recover safely, including data restore if required.
Review and reconcile, including changelog and documentation updates.

Runbook classes¶

Class	Typical targets	Mandatory?
Platform runbooks	fleet, argocd, traefik, calico, metallb, longhorn, velero	Yes
Stateful service runbooks	minio, nextcloud, gitea, vaultwarden, gitlab, databases	Yes
Security and access runbooks	authelia, crowdsec-lapi, cloudflared	Yes
Utility service runbooks	Tier 2 or Tier 3 utilities	Recommended when shared or externally exposed

Minimum content for every runbook¶

Metadata and ownership
Health checks and probe verification
Common failure-mode diagnostics
Rollback or stabilization actions
Disaster recovery steps
Scaling and resource-tuning guidance
Post-incident update requirements

Current coverage snapshot¶

Status as of 2026-07-02:

Current runbooks: 23.
Existing runbooks are enforced by scripts/validate_docs.py.
Current service pages without runbooks: actualbudget, mealie, and ollama.

Current runbook gaps¶

Create these runbooks before expanding to lower-priority utilities:

csi-driver-nfs
longhorn
velero
calico
cloudflared
crowdsec-lapi
minio
mimir
grafana-agent
kube-prometheus-stack

Current runbook set¶

Operating rule¶

If a service would cause significant user, data, or platform impact when unavailable, it should not be considered operationally complete without a validated runbook.