Skip to content

Runbook Strategy

Purpose

Runbooks turn static architecture knowledge into repeatable operator action. They should optimize for fast triage and safe recovery, not for exhaustive prose.

Standard response model

Every runbook should be organized around the same five operational stages:

  1. Detect the problem from alerts, symptoms, or user reports.
  2. Triage scope and blast radius with a short set of standard commands.
  3. Stabilize the service or reduce impact.
  4. Recover safely, including data restore if required.
  5. Review and reconcile, including changelog and documentation updates.

Runbook classes

Class Typical targets Mandatory?
Platform runbooks fleet, argocd, traefik, calico, metallb, longhorn, velero Yes
Stateful service runbooks minio, nextcloud, gitea, vaultwarden, gitlab, databases Yes
Security and access runbooks authelia, crowdsec-lapi, cloudflared Yes
Utility service runbooks Tier 2 or Tier 3 utilities Recommended when shared or externally exposed

Minimum content for every runbook

  • Metadata and ownership
  • Health checks and probe verification
  • Common failure-mode diagnostics
  • Rollback or stabilization actions
  • Disaster recovery steps
  • Scaling and resource-tuning guidance
  • Post-incident update requirements

Current coverage snapshot

Status as of 2026-07-02:

  • Current runbooks: 23.
  • Existing runbooks are enforced by scripts/validate_docs.py.
  • Current service pages without runbooks: actualbudget, mealie, and ollama.

Current runbook gaps

Create these runbooks before expanding to lower-priority utilities:

  • csi-driver-nfs
  • longhorn
  • velero
  • calico
  • cloudflared
  • crowdsec-lapi
  • minio
  • mimir
  • grafana-agent
  • kube-prometheus-stack

Current runbook set

Operating rule

If a service would cause significant user, data, or platform impact when unavailable, it should not be considered operationally complete without a validated runbook.