Skip to content

Homelab Documentation Blueprint

Objective

The homelab already uses GitOps conventions for manifests and cluster promotion. The missing piece is a professional knowledge base that answers four operational questions quickly:

  1. What is deployed, where, and why?
  2. How is each workload configured, exposed, and recovered?
  3. What changed, when, and with what operational risk?
  4. Which document should an operator open first during an incident?

This blueprint defines a documentation system that is versioned, reviewable, and aligned with the existing branch and validation workflow.

Authoritative tooling choice

Tooling option Role in the target model Decision
MkDocs in the repository Authoritative, reviewable, Git-native knowledge base Recommended source of truth
Wiki Consumer-friendly published mirror if stakeholders need a browser-first view Optional secondary publication target
Obsidian Personal note taking, drafts, architecture ideation Optional authoring aid, never the source of truth

The recommended operating model is therefore:

  • MkDocs in this repository is the canonical documentation system.
  • Git remains the system of record for manifests, docs, and release notes.
  • A wiki can be generated or mirrored later if read-only access is needed for non-contributors.
  • Personal notes may exist in Obsidian, but any durable operational knowledge must be promoted back into docs/.

Documentation hierarchy

The knowledge base should be organized in layers so that operators can move from global architecture to a single deployment without searching across unrelated notes.

Repository root
|-- WORKFLOW.md                   # GitOps and branch strategy
|-- CHANGELOG.md                  # Platform-level change history
|-- mkdocs.yml                    # Documentation navigation
`-- docs/
    |-- index.md                  # Entry point and operating rules
    |-- strategy/                 # Global blueprint and governance references
    |-- governance/               # Change management and documentation policy
    |-- layers/                   # Infrastructure, cluster, network, storage, apps
    |-- services/                 # One page per top-level deployment directory
    |-- runbooks/                 # One runbook per critical service or platform area
    `-- templates/                # Mandatory templates for new pages

Layer responsibilities

Layer What belongs there Examples in this repository
Infrastructure Node, host, hypervisor, bootstrap, external dependencies rancher, node-feature-discovery, kube-metrics-server
Cluster management GitOps control plane, namespace model, RBAC, policy, promotion fleet, argocd, devtron
Networking CNI, ingress, load-balancing, DNS and edge access calico, metallb, traefik, cloudflared
Storage CSI, PVC classes, backup, snapshots, object storage, restore longhorn, csi-driver-nfs, democratic-csi, minio, velero
Application layer User-facing and platform services deployed on the clusters authelia, gitea, nextcloud, loki, vaultwarden

Required documentation objects

1. Layer pages

Each layer page must answer:

  • What components exist in that layer?
  • What are the hard dependencies and failure domains?
  • Which clusters consume the layer?
  • What are the operational risks and recovery expectations?
  • Where are the service-specific documents and runbooks?

2. Service pages

Each top-level deployment directory should eventually map to:

  • docs/services/SERVICE_NAME.md for the descriptive and architectural reference
  • docs/runbooks/SERVICE_NAME.md for operational procedures when the service is critical, stateful, or externally exposed

This mapping keeps the documentation aligned with the repository layout and removes ambiguity when searching for a workload.

3. Platform change history

The root CHANGELOG.md records platform-wide changes and links the operational narrative to GitOps commits and releases.

4. Incident and recovery procedures

Runbooks are not optional add-ons. They are the execution layer for the knowledge base and should exist before a service is considered production-grade in the homelab.

Standard metadata for every service

Every service page should start with the same metadata block so that the service catalogue becomes queryable and consistent.

Field Description
Service name Directory and deployment name
Business or operational purpose Why the workload exists
Criticality Tier 0, Tier 1, Tier 2, or Tier 3
Owner Person or team accountable for changes
Clusters homelab, local, jls
Namespace Runtime namespace
Exposure Internal only, VPN, LAN, internet
Stateful Yes or no
Backup class Snapshot, Velero, app-native, none
RPO and RTO Recovery expectations
Dependencies Database, storage class, ingress, identity, DNS

Criticality model

Tier Meaning Typical examples
Tier 0 Platform control plane or recovery-critical service; outage blocks broad platform operations fleet, argocd, traefik, longhorn, metallb, velero
Tier 1 Security, identity, observability, shared stateful services, or high-value user-facing services authelia, crowdsec-lapi, cloudflared, minio, gitea, nextcloud, kube-prometheus-stack
Tier 2 Important but isolated services with moderate blast radius seafile, mealie, photoprism, gotify
Tier 3 Low-risk utilities, experiments, demos, or ephemeral tooling whoami, dashy, test-only services

The criticality tier determines required documentation depth, recovery expectations, and whether a dedicated runbook is mandatory.

Documentation lifecycle

Documentation should follow the same operational path as the manifests:

Design change
  -> feature branch
  -> manifests updated
  -> service page and runbook updated
  -> CHANGELOG entry added under Unreleased
  -> PR review on dev
  -> validation
  -> merge to main
  -> release tag and operational promotion

Key rule: a deployment change is incomplete until the operator can explain it from the docs without reading the diff.

Service documentation standard

Every service page must include at least the following sections:

  1. Service overview: purpose, dependencies, owner, clusters, and criticality.
  2. Architecture diagram: traffic path, storage dependencies, identity path, and adjacent systems.
  3. Deployment specifications: source manifests, Helm chart or Kustomize references, namespaces, overlays, and workload kinds.
  4. Configuration guide: environment variables, ConfigMaps, Secrets sources, rotation rules, and configuration drift notes.
  5. Access protocols: internal and external URLs, ports, authentication method, TLS termination point, and network restrictions.

Recommended extensions for Tier 0 and Tier 1 services:

  • Observability and alert ownership
  • Backup and restore notes
  • Capacity expectations
  • Known failure modes
  • Rollback notes

Runbook strategy

Runbooks should be structured to reduce mean time to recovery rather than to explain theory.

Every runbook should be organized around five operator stages:

  1. Detect: what alert, symptom, or dashboard indicates a problem.
  2. Triage: what first commands confirm scope and blast radius.
  3. Stabilize: what immediate actions reduce user impact.
  4. Recover: how to restore service safely, including data recovery if needed.
  5. Review: what evidence, changelog entries, or postmortem updates must be recorded.

Separate runbooks should exist for:

  • Platform-wide disaster recovery
  • Shared storage recovery
  • Ingress or identity outages
  • Each stateful or internet-facing service

Change management principles

The homelab should adopt SemVer at the platform level:

  • Major version: breaking architecture changes, destructive migrations, controller swaps, cluster topology changes, default storage-class or ingress behavior changes.
  • Minor version: additive capabilities, new services, non-breaking chart upgrades, new overlays, new backups or observability coverage.
  • Patch version: bug fixes, hotfixes, probe tuning, security remediations, small resource changes, or low-risk configuration corrections.

This does not replace application versions. Service pages should continue to track upstream chart versions, image tags, and app versions separately.

Definition of done for documentation

A service is considered documented only when:

  1. The service page exists and all mandatory fields are filled.
  2. Dependencies, URLs, namespaces, and manifest paths are current.
  3. A runbook exists if the service is Tier 0 or Tier 1, stateful, or internet-facing.
  4. The root CHANGELOG.md reflects recent operational changes affecting the service.
  5. Another operator can perform first-line diagnostics from the documentation alone.

Adoption guidance

The fastest path to value is not full coverage on day one. Start with control-plane, edge, storage, identity, and backup services; then move to stateful application workloads; then backfill lower-risk utilities.

The implementation roadmap provides the sequencing and acceptance criteria for that rollout.