Skip to content

Implementation Roadmap

Goal

Keep the GitOps knowledge base current with the actual repository state, while continuing to backfill missing workload documentation in safe, reviewable units.

The repository is no longer at the initial framework stage. The documentation framework, service catalogue, runbook catalogue, changelog fragments, MkDocs navigation, and validation script are all active. The remaining work is inventory-driven backfill and quality-gate expansion.

Current Status Snapshot

Status as of 2026-07-02:

  • MkDocs source lives under docs/; generated site/ output is not edited directly.
  • The catalogue currently contains 26 service pages under docs/services/.
  • The runbook set currently contains 23 runbooks under docs/runbooks/.
  • scripts/validate_docs.py enforces every existing service page and every existing runbook.
  • fleet is documented as a bootstrap and GitOps support service even though it is not a normal application deployment directory.
  • The current path inventory detects 58 deployment-like top-level directories using fleet.yaml, base/, overlays/, or root kustomization markers.
  • 25 detected deployment directories currently have service pages.
  • The documented services without runbooks are actualbudget, mealie, and ollama.

Current service pages:

actualbudget
argocd
authelia
defectdojo
fleet
forgejo
gitea
gitlab
grafana-dashboard
infisical
k8s-monitoring
lgtm-distributed
loki
mealie
metallb
nextcloud
ollama
openvas
prometheus
pulp3
rancher
renovate
semaphore
tailscale-operator
traefik
vaultwarden

Current runbooks:

argocd
authelia
defectdojo
fleet
forgejo
gitea
gitlab
grafana-dashboard
infisical
k8s-monitoring
lgtm-distributed
loki
metallb
nextcloud
openvas
prometheus
pulp3
rancher
renovate
semaphore
tailscale-operator
traefik
vaultwarden

Detected deployment directories still missing service pages:

arr-stack
awx-operator
babybuddy
bichon
cloudflared
csi-driver-nfs
democratic-csi
filestash
firefly3
gotify
grocy
headscale
homepage
keel
kitchenowl
longhorn
mimir
minio
monitoring
neko
netbox
nfs-subdir-external-provisioner
node-feature-discovery
open-webui
openclaw
photoprism
phpmyadmin
portainer
rdtclient
repomanager
seafile
uptime-kuma
wazuh

Legacy, support, or manual YAML areas that are not fully covered by the deployment-directory heuristic and still need a documentation decision:

Vui
backuppc
calico
crowdsec-lapi
dashy
grafana-agent
home-assistant
kube-metrics-server
kube-prometheus-stack
oci-watchdog
robusta
velero
whoami

Phase 0: Framework Baseline

Status: complete.

Delivered:

  • MkDocs navigation and docs/ directory structure.
  • Platform blueprint and change-management policy.
  • Application and runbook templates.
  • Service and runbook catalogue indexes.
  • Root changelog fragment workflow under changelogs/.
  • Documentation validation through scripts/validate_docs.py.

Acceptance criteria status:

  • Operators have a standard location for service pages, runbooks, templates, and roadmap material.
  • Pull requests can reference a consistent documentation model.

Phase 1: Tier 0 Platform Services

Objective: cover the services whose failure blocks platform operations or recovery.

Current coverage:

  • Documented with service page and runbook: fleet, argocd, traefik, metallb.
  • Still missing first-class docs from the original Tier 0 list: calico, longhorn, csi-driver-nfs, velero.
  • Related storage and recovery backlog: democratic-csi, nfs-subdir-external-provisioner.

Next safe units:

  1. Create service pages and runbooks for csi-driver-nfs, longhorn, and velero.
  2. Document calico as a legacy/manual network layer component unless it is intentionally retired.
  3. Decide whether democratic-csi and nfs-subdir-external-provisioner need Tier 0 runbooks or lower-tier service pages.

Acceptance criteria:

  • Another operator can identify how GitOps, ingress, networking, storage, and backup recovery work from repository documentation alone.
  • All active Tier 0 service pages and runbooks are listed in MkDocs navigation and enforced by scripts/validate_docs.py.

Phase 2: Security, Identity, and Edge Controls

Objective: reduce operational risk for internet-facing and access-control components.

Current coverage:

  • Documented with service page and runbook: authelia, rancher, defectdojo, openvas, tailscale-operator, infisical, renovate.
  • Still missing service pages from the original edge-control backlog: cloudflared, crowdsec-lapi.
  • Additional security or edge backlog from the current tree: headscale, wazuh, whoami if still used for ingress smoke tests.

Acceptance criteria:

  • Authentication flow, ingress exposure, TLS boundaries, and emergency access steps are documented.
  • Manual recovery from edge or auth outages has a tested runbook.

Phase 3: Observability and Shared Stateful Services

Objective: document the systems needed to observe incidents and preserve data.

Current coverage:

  • Documented with service page and runbook: grafana-dashboard, k8s-monitoring, lgtm-distributed, loki, prometheus, forgejo, gitea, gitlab, nextcloud, vaultwarden, pulp3.
  • Documented with service page only: actualbudget, mealie, ollama.
  • Still missing from the original observability and shared-state backlog: kube-prometheus-stack, mimir, grafana-agent, minio.
  • Additional stateful app backlog from the current tree: arr-stack, babybuddy, firefly3, grocy, kitchenowl, netbox, photoprism, portainer, rdtclient, repomanager, seafile, uptime-kuma.

Acceptance criteria:

  • Backup and restore paths are documented.
  • Capacity, retention, and storage dependencies are explicit.
  • Common failure modes have diagnostic commands and rollback notes.

Phase 4: Remaining Application Layer

Objective: bring the rest of the service estate to a consistent baseline.

Current coverage:

  • Documented user-facing or utility applications include actualbudget, mealie, ollama, semaphore, and several source-control and observability services listed above.
  • The current missing service-page backlog is the detected deployment-directory list in the status snapshot.

Approach from this point:

  • Document Tier 2 services in descending order of user impact and data importance.
  • Document Tier 3 utilities opportunistically or when they become shared dependencies.
  • When a service page is added, add a runbook at the same time if the service is stateful, externally exposed, security-sensitive, or recovery-sensitive.
  • Add the service to scripts/validate_docs.py only after its README, service page, and required runbook exist.

Acceptance criteria:

  • Every top-level deployment directory has a matching service page.
  • All stateful and externally exposed services have runbooks.

Phase 5: Quality Gates

Objective: make the documentation practice durable.

Current controls:

  • scripts/validate_docs.py checks required service-page and runbook sections.
  • mkdocs.yml navigation includes the service and runbook catalogue.
  • The active Forgejo workflow runs repository validation through Makefile targets.
  • The pull request template references README, service docs, runbooks, changelog fragments, and docs validation.
  • Changelog fragments are stored under changelogs/fragments/.

Remaining controls to add once broader baseline coverage exists:

  • Pull-request checklist requiring README, service-doc, and changelog fragment review when manifests change
  • CI checks for documentation inventory completeness across more deployment directories
  • Scheduled quarterly documentation review for stale URLs, owners, and recovery steps
  • Release tagging from main after changelog review

Acceptance criteria:

  • Documentation freshness is measured and reviewed, not assumed.

Sequencing Principle

Prioritize services in this order whenever time is limited:

  1. Platform recovery blockers
  2. Internet-facing and identity services
  3. Shared stateful workloads
  4. Observability stack
  5. Remaining applications and utilities

Next Implementation Wave

Recommended next units:

  1. Storage and recovery: csi-driver-nfs, longhorn, velero, then democratic-csi.
  2. Edge and security gaps: cloudflared, crowdsec-lapi, headscale, wazuh.
  3. Observability gaps: mimir, grafana-agent, kube-prometheus-stack.
  4. Shared stateful services: minio, netbox, seafile, portainer, arr-stack.
  5. Lower-tier utilities and single-purpose apps from the remaining backlog.

Keep the cadence change-scoped: document a workload when it is touched, and prioritize a standalone documentation backfill only when the workload is operationally critical or blocks recovery.