lgtm-distributed Runbook¶
Metadata¶
| Field | Value |
|---|---|
| Service | lgtm-distributed |
| Criticality | Tier 1 |
| Owner | Platform / Observability owner |
| Namespace | monitoring |
| Clusters | homelab |
| Last validated | 2026-05-20 |
| Related service page | ../services/lgtm-distributed.md |
Trigger Conditions¶
- Shared observability queries fail.
- Loki, Tempo, Mimir, or bundled Grafana backends are degraded.
- Object storage or PVC-backed components show errors.
1. Health Checks¶
2. Troubleshooting Workflows¶
Check failing backend components one by one and confirm storage credentials and PVC state.
3. Disaster Recovery¶
- Restore storage credentials and backend secrets.
- Restore object-store or PVC-backed data where needed.
- Reconcile
lgtm-distributed/prodthrough Fleet. - Validate Grafana queries against each backend.
4. Scaling and Resource Management¶
Adjust chart values in Git when a backend component saturates CPU, memory, or disk.
5. Maintenance Procedures¶
- Review storage and retention settings before upgrades.
- Rotate backend secrets and object-store credentials.
- Validate bundled component versions together.
6. Rollback Strategy¶
- Revert the Helm values and Fleet bundle to the last known-good revision.
- Restore backend state if a chart upgrade damaged persisted data.
7. Post-Incident Actions¶
- Add a changelog fragment for recovery actions.
- Update the service page if backend topology changed.
- Extend this runbook with the failing component and fix path.