Skip to content

lgtm-distributed Runbook

Metadata

Field Value
Service lgtm-distributed
Criticality Tier 1
Owner Platform / Observability owner
Namespace monitoring
Clusters homelab
Last validated 2026-05-20
Related service page ../services/lgtm-distributed.md

Trigger Conditions

  • Shared observability queries fail.
  • Loki, Tempo, Mimir, or bundled Grafana backends are degraded.
  • Object storage or PVC-backed components show errors.

1. Health Checks

kubectl -n monitoring get pods,svc,pvc
kubectl -n monitoring logs deploy/<component> --tail=100

2. Troubleshooting Workflows

Check failing backend components one by one and confirm storage credentials and PVC state.

kubectl -n monitoring describe pod <failing-pod>
kubectl -n monitoring get secret

3. Disaster Recovery

  1. Restore storage credentials and backend secrets.
  2. Restore object-store or PVC-backed data where needed.
  3. Reconcile lgtm-distributed/prod through Fleet.
  4. Validate Grafana queries against each backend.

4. Scaling and Resource Management

kubectl -n monitoring top pod

Adjust chart values in Git when a backend component saturates CPU, memory, or disk.

5. Maintenance Procedures

  • Review storage and retention settings before upgrades.
  • Rotate backend secrets and object-store credentials.
  • Validate bundled component versions together.

6. Rollback Strategy

  • Revert the Helm values and Fleet bundle to the last known-good revision.
  • Restore backend state if a chart upgrade damaged persisted data.

7. Post-Incident Actions

  1. Add a changelog fragment for recovery actions.
  2. Update the service page if backend topology changed.
  3. Extend this runbook with the failing component and fix path.