loki Runbook¶
Metadata¶
| Field | Value |
|---|---|
| Service | loki |
| Criticality | Tier 1 |
| Owner | Platform / Observability owner |
| Namespace | loki |
| Clusters | prod, jls, layer7, oci, oci-free |
| Last validated | 2026-05-20 |
| Related service page | ../services/loki.md |
Trigger Conditions¶
- Log ingestion stops.
- Queries fail or time out.
- Object-store auth errors appear.
- Gateway or ingress routes return errors.
1. Health Checks¶
2. Troubleshooting Workflows¶
Check the failing path first: gateway, ingestion, storage, or query.
kubectl -n loki describe pod <failing-pod>
kubectl -n loki get secret
kubectl -n loki logs <failing-pod> --tail=200
Inspect object-store credentials, compactor state, and ingress health.
3. Disaster Recovery¶
- Restore storage credentials and object-store access.
- Restore PVC-backed state if the deployment mode uses local storage.
- Reconcile the environment-specific Loki path.
- Validate ingestion and query APIs.
4. Scaling and Resource Management¶
Tune resources or replica counts in Git for the saturated component only.
5. Maintenance Procedures¶
- Review retention and compaction settings.
- Rotate storage credentials.
- Validate the active Loki mode before applying chart upgrades.
6. Rollback Strategy¶
- Revert the environment-specific Loki revision.
- Restore storage state if an upgrade or mode change corrupted ingestion.
7. Post-Incident Actions¶
- Capture manual recovery in a changelog fragment.
- Update the service page if deployment modes or clusters changed.
- Add the exact failure signature to this runbook.