Homelab Documentation Blueprint¶
Objective¶
The homelab already uses GitOps conventions for manifests and cluster promotion. The missing piece is a professional knowledge base that answers four operational questions quickly:
- What is deployed, where, and why?
- How is each workload configured, exposed, and recovered?
- What changed, when, and with what operational risk?
- Which document should an operator open first during an incident?
This blueprint defines a documentation system that is versioned, reviewable, and aligned with the existing branch and validation workflow.
Authoritative tooling choice¶
| Tooling option | Role in the target model | Decision |
|---|---|---|
| MkDocs in the repository | Authoritative, reviewable, Git-native knowledge base | Recommended source of truth |
| Wiki | Consumer-friendly published mirror if stakeholders need a browser-first view | Optional secondary publication target |
| Obsidian | Personal note taking, drafts, architecture ideation | Optional authoring aid, never the source of truth |
The recommended operating model is therefore:
- MkDocs in this repository is the canonical documentation system.
- Git remains the system of record for manifests, docs, and release notes.
- A wiki can be generated or mirrored later if read-only access is needed for non-contributors.
- Personal notes may exist in Obsidian, but any durable operational knowledge must be promoted back into docs/.
Documentation hierarchy¶
The knowledge base should be organized in layers so that operators can move from global architecture to a single deployment without searching across unrelated notes.
Repository root
|-- WORKFLOW.md # GitOps and branch strategy
|-- CHANGELOG.md # Platform-level change history
|-- mkdocs.yml # Documentation navigation
`-- docs/
|-- index.md # Entry point and operating rules
|-- strategy/ # Global blueprint and governance references
|-- governance/ # Change management and documentation policy
|-- layers/ # Infrastructure, cluster, network, storage, apps
|-- services/ # One page per top-level deployment directory
|-- runbooks/ # One runbook per critical service or platform area
`-- templates/ # Mandatory templates for new pages
Layer responsibilities¶
| Layer | What belongs there | Examples in this repository |
|---|---|---|
| Infrastructure | Node, host, hypervisor, bootstrap, external dependencies | rancher, node-feature-discovery, kube-metrics-server |
| Cluster management | GitOps control plane, namespace model, RBAC, policy, promotion | fleet, argocd, devtron |
| Networking | CNI, ingress, load-balancing, DNS and edge access | calico, metallb, traefik, cloudflared |
| Storage | CSI, PVC classes, backup, snapshots, object storage, restore | longhorn, csi-driver-nfs, democratic-csi, minio, velero |
| Application layer | User-facing and platform services deployed on the clusters | authelia, gitea, nextcloud, loki, vaultwarden |
Required documentation objects¶
1. Layer pages¶
Each layer page must answer:
- What components exist in that layer?
- What are the hard dependencies and failure domains?
- Which clusters consume the layer?
- What are the operational risks and recovery expectations?
- Where are the service-specific documents and runbooks?
2. Service pages¶
Each top-level deployment directory should eventually map to:
- docs/services/SERVICE_NAME.md for the descriptive and architectural reference
- docs/runbooks/SERVICE_NAME.md for operational procedures when the service is critical, stateful, or externally exposed
This mapping keeps the documentation aligned with the repository layout and removes ambiguity when searching for a workload.
3. Platform change history¶
The root CHANGELOG.md records platform-wide changes and links the operational narrative to GitOps commits and releases.
4. Incident and recovery procedures¶
Runbooks are not optional add-ons. They are the execution layer for the knowledge base and should exist before a service is considered production-grade in the homelab.
Standard metadata for every service¶
Every service page should start with the same metadata block so that the service catalogue becomes queryable and consistent.
| Field | Description |
|---|---|
| Service name | Directory and deployment name |
| Business or operational purpose | Why the workload exists |
| Criticality | Tier 0, Tier 1, Tier 2, or Tier 3 |
| Owner | Person or team accountable for changes |
| Clusters | homelab, local, jls |
| Namespace | Runtime namespace |
| Exposure | Internal only, VPN, LAN, internet |
| Stateful | Yes or no |
| Backup class | Snapshot, Velero, app-native, none |
| RPO and RTO | Recovery expectations |
| Dependencies | Database, storage class, ingress, identity, DNS |
Criticality model¶
| Tier | Meaning | Typical examples |
|---|---|---|
| Tier 0 | Platform control plane or recovery-critical service; outage blocks broad platform operations | fleet, argocd, traefik, longhorn, metallb, velero |
| Tier 1 | Security, identity, observability, shared stateful services, or high-value user-facing services | authelia, crowdsec-lapi, cloudflared, minio, gitea, nextcloud, kube-prometheus-stack |
| Tier 2 | Important but isolated services with moderate blast radius | seafile, mealie, photoprism, gotify |
| Tier 3 | Low-risk utilities, experiments, demos, or ephemeral tooling | whoami, dashy, test-only services |
The criticality tier determines required documentation depth, recovery expectations, and whether a dedicated runbook is mandatory.
Documentation lifecycle¶
Documentation should follow the same operational path as the manifests:
Design change
-> feature branch
-> manifests updated
-> service page and runbook updated
-> CHANGELOG entry added under Unreleased
-> PR review on dev
-> validation
-> merge to main
-> release tag and operational promotion
Key rule: a deployment change is incomplete until the operator can explain it from the docs without reading the diff.
Service documentation standard¶
Every service page must include at least the following sections:
- Service overview: purpose, dependencies, owner, clusters, and criticality.
- Architecture diagram: traffic path, storage dependencies, identity path, and adjacent systems.
- Deployment specifications: source manifests, Helm chart or Kustomize references, namespaces, overlays, and workload kinds.
- Configuration guide: environment variables, ConfigMaps, Secrets sources, rotation rules, and configuration drift notes.
- Access protocols: internal and external URLs, ports, authentication method, TLS termination point, and network restrictions.
Recommended extensions for Tier 0 and Tier 1 services:
- Observability and alert ownership
- Backup and restore notes
- Capacity expectations
- Known failure modes
- Rollback notes
Runbook strategy¶
Runbooks should be structured to reduce mean time to recovery rather than to explain theory.
Every runbook should be organized around five operator stages:
- Detect: what alert, symptom, or dashboard indicates a problem.
- Triage: what first commands confirm scope and blast radius.
- Stabilize: what immediate actions reduce user impact.
- Recover: how to restore service safely, including data recovery if needed.
- Review: what evidence, changelog entries, or postmortem updates must be recorded.
Separate runbooks should exist for:
- Platform-wide disaster recovery
- Shared storage recovery
- Ingress or identity outages
- Each stateful or internet-facing service
Change management principles¶
The homelab should adopt SemVer at the platform level:
- Major version: breaking architecture changes, destructive migrations, controller swaps, cluster topology changes, default storage-class or ingress behavior changes.
- Minor version: additive capabilities, new services, non-breaking chart upgrades, new overlays, new backups or observability coverage.
- Patch version: bug fixes, hotfixes, probe tuning, security remediations, small resource changes, or low-risk configuration corrections.
This does not replace application versions. Service pages should continue to track upstream chart versions, image tags, and app versions separately.
Definition of done for documentation¶
A service is considered documented only when:
- The service page exists and all mandatory fields are filled.
- Dependencies, URLs, namespaces, and manifest paths are current.
- A runbook exists if the service is Tier 0 or Tier 1, stateful, or internet-facing.
- The root CHANGELOG.md reflects recent operational changes affecting the service.
- Another operator can perform first-line diagnostics from the documentation alone.
Adoption guidance¶
The fastest path to value is not full coverage on day one. Start with control-plane, edge, storage, identity, and backup services; then move to stateful application workloads; then backfill lower-risk utilities.
The implementation roadmap provides the sequencing and acceptance criteria for that rollout.