Skip to content

K8s Homelab Knowledge Base

lgtm-distributed

lgtm-distributed Runbook¶

Metadata¶

Field	Value
Service	lgtm-distributed
Criticality	Tier 1
Owner	Platform / Observability owner
Namespace	monitoring
Clusters	homelab
Last validated	2026-05-20
Related service page	../services/lgtm-distributed.md

Trigger Conditions¶

Shared observability queries fail.
Loki, Tempo, Mimir, or bundled Grafana backends are degraded.
Object storage or PVC-backed components show errors.

1. Health Checks¶

kubectl -n monitoring get pods,svc,pvc
kubectl -n monitoring logs deploy/<component> --tail=100

2. Troubleshooting Workflows¶

Check failing backend components one by one and confirm storage credentials and PVC state.

kubectl -n monitoring describe pod <failing-pod>
kubectl -n monitoring get secret

3. Disaster Recovery¶

Restore storage credentials and backend secrets.
Restore object-store or PVC-backed data where needed.
Reconcile lgtm-distributed/prod through Fleet.
Validate Grafana queries against each backend.

4. Scaling and Resource Management¶

kubectl -n monitoring top pod

Adjust chart values in Git when a backend component saturates CPU, memory, or disk.

5. Maintenance Procedures¶

Review storage and retention settings before upgrades.
Rotate backend secrets and object-store credentials.
Validate bundled component versions together.

6. Rollback Strategy¶

Revert the Helm values and Fleet bundle to the last known-good revision.
Restore backend state if a chart upgrade damaged persisted data.

7. Post-Incident Actions¶

Add a changelog fragment for recovery actions.
Update the service page if backend topology changed.
Extend this runbook with the failing component and fix path.