Homelab Documentation Blueprint¶

Objective¶

The homelab already uses GitOps conventions for manifests and cluster promotion. The missing piece is a professional knowledge base that answers four operational questions quickly:

What is deployed, where, and why?
How is each workload configured, exposed, and recovered?
What changed, when, and with what operational risk?
Which document should an operator open first during an incident?

This blueprint defines a documentation system that is versioned, reviewable, and aligned with the existing branch and validation workflow.

Authoritative tooling choice¶

Tooling option	Role in the target model	Decision
MkDocs in the repository	Authoritative, reviewable, Git-native knowledge base	Recommended source of truth
Wiki	Consumer-friendly published mirror if stakeholders need a browser-first view	Optional secondary publication target
Obsidian	Personal note taking, drafts, architecture ideation	Optional authoring aid, never the source of truth

The recommended operating model is therefore:

MkDocs in this repository is the canonical documentation system.
Git remains the system of record for manifests, docs, and release notes.
A wiki can be generated or mirrored later if read-only access is needed for non-contributors.
Personal notes may exist in Obsidian, but any durable operational knowledge must be promoted back into docs/.

Documentation hierarchy¶

The knowledge base should be organized in layers so that operators can move from global architecture to a single deployment without searching across unrelated notes.

Repository root
|-- WORKFLOW.md                   # GitOps and branch strategy
|-- CHANGELOG.md                  # Platform-level change history
|-- mkdocs.yml                    # Documentation navigation
`-- docs/
    |-- index.md                  # Entry point and operating rules
    |-- strategy/                 # Global blueprint and governance references
    |-- governance/               # Change management and documentation policy
    |-- layers/                   # Infrastructure, cluster, network, storage, apps
    |-- services/                 # One page per top-level deployment directory
    |-- runbooks/                 # One runbook per critical service or platform area
    `-- templates/                # Mandatory templates for new pages

Layer responsibilities¶

Layer	What belongs there	Examples in this repository
Infrastructure	Node, host, hypervisor, bootstrap, external dependencies	rancher, node-feature-discovery, kube-metrics-server
Cluster management	GitOps control plane, namespace model, RBAC, policy, promotion	fleet, argocd, devtron
Networking	CNI, ingress, load-balancing, DNS and edge access	calico, metallb, traefik, cloudflared
Storage	CSI, PVC classes, backup, snapshots, object storage, restore	longhorn, csi-driver-nfs, democratic-csi, minio, velero
Application layer	User-facing and platform services deployed on the clusters	authelia, gitea, nextcloud, loki, vaultwarden

Required documentation objects¶

1. Layer pages¶

Each layer page must answer:

What components exist in that layer?
What are the hard dependencies and failure domains?
Which clusters consume the layer?
What are the operational risks and recovery expectations?
Where are the service-specific documents and runbooks?

2. Service pages¶

Each top-level deployment directory should eventually map to:

docs/services/SERVICE_NAME.md for the descriptive and architectural reference
docs/runbooks/SERVICE_NAME.md for operational procedures when the service is critical, stateful, or externally exposed

This mapping keeps the documentation aligned with the repository layout and removes ambiguity when searching for a workload.

3. Platform change history¶

The root CHANGELOG.md records platform-wide changes and links the operational narrative to GitOps commits and releases.

4. Incident and recovery procedures¶

Runbooks are not optional add-ons. They are the execution layer for the knowledge base and should exist before a service is considered production-grade in the homelab.

Standard metadata for every service¶

Every service page should start with the same metadata block so that the service catalogue becomes queryable and consistent.

Field	Description
Service name	Directory and deployment name
Business or operational purpose	Why the workload exists
Criticality	Tier 0, Tier 1, Tier 2, or Tier 3
Owner	Person or team accountable for changes
Clusters	homelab, local, jls
Namespace	Runtime namespace
Exposure	Internal only, VPN, LAN, internet
Stateful	Yes or no
Backup class	Snapshot, Velero, app-native, none
RPO and RTO	Recovery expectations
Dependencies	Database, storage class, ingress, identity, DNS

Criticality model¶

Tier	Meaning	Typical examples
Tier 0	Platform control plane or recovery-critical service; outage blocks broad platform operations	fleet, argocd, traefik, longhorn, metallb, velero
Tier 1	Security, identity, observability, shared stateful services, or high-value user-facing services	authelia, crowdsec-lapi, cloudflared, minio, gitea, nextcloud, kube-prometheus-stack
Tier 2	Important but isolated services with moderate blast radius	seafile, mealie, photoprism, gotify
Tier 3	Low-risk utilities, experiments, demos, or ephemeral tooling	whoami, dashy, test-only services

The criticality tier determines required documentation depth, recovery expectations, and whether a dedicated runbook is mandatory.

Documentation lifecycle¶

Documentation should follow the same operational path as the manifests:

Design change
  -> feature branch
  -> manifests updated
  -> service page and runbook updated
  -> CHANGELOG entry added under Unreleased
  -> PR review on dev
  -> validation
  -> merge to main
  -> release tag and operational promotion

Key rule: a deployment change is incomplete until the operator can explain it from the docs without reading the diff.

Service documentation standard¶

Every service page must include at least the following sections:

Service overview: purpose, dependencies, owner, clusters, and criticality.
Architecture diagram: traffic path, storage dependencies, identity path, and adjacent systems.
Deployment specifications: source manifests, Helm chart or Kustomize references, namespaces, overlays, and workload kinds.
Configuration guide: environment variables, ConfigMaps, Secrets sources, rotation rules, and configuration drift notes.
Access protocols: internal and external URLs, ports, authentication method, TLS termination point, and network restrictions.

Recommended extensions for Tier 0 and Tier 1 services:

Observability and alert ownership
Backup and restore notes
Capacity expectations
Known failure modes
Rollback notes

Runbook strategy¶

Runbooks should be structured to reduce mean time to recovery rather than to explain theory.

Every runbook should be organized around five operator stages:

Detect: what alert, symptom, or dashboard indicates a problem.
Triage: what first commands confirm scope and blast radius.
Stabilize: what immediate actions reduce user impact.
Recover: how to restore service safely, including data recovery if needed.
Review: what evidence, changelog entries, or postmortem updates must be recorded.

Separate runbooks should exist for:

Platform-wide disaster recovery
Shared storage recovery
Ingress or identity outages
Each stateful or internet-facing service

Change management principles¶

The homelab should adopt SemVer at the platform level:

Major version: breaking architecture changes, destructive migrations, controller swaps, cluster topology changes, default storage-class or ingress behavior changes.
Minor version: additive capabilities, new services, non-breaking chart upgrades, new overlays, new backups or observability coverage.
Patch version: bug fixes, hotfixes, probe tuning, security remediations, small resource changes, or low-risk configuration corrections.

This does not replace application versions. Service pages should continue to track upstream chart versions, image tags, and app versions separately.

Definition of done for documentation¶

A service is considered documented only when:

The service page exists and all mandatory fields are filled.
Dependencies, URLs, namespaces, and manifest paths are current.
A runbook exists if the service is Tier 0 or Tier 1, stateful, or internet-facing.
The root CHANGELOG.md reflects recent operational changes affecting the service.
Another operator can perform first-line diagnostics from the documentation alone.

Adoption guidance¶

The fastest path to value is not full coverage on day one. Start with control-plane, edge, storage, identity, and backup services; then move to stateful application workloads; then backfill lower-risk utilities.

The implementation roadmap provides the sequencing and acceptance criteria for that rollout.