Skip to content

loki Runbook

Metadata

Field Value
Service loki
Criticality Tier 1
Owner Platform / Observability owner
Namespace loki
Clusters prod, jls, layer7, oci, oci-free
Last validated 2026-05-20
Related service page ../services/loki.md

Trigger Conditions

  • Log ingestion stops.
  • Queries fail or time out.
  • Object-store auth errors appear.
  • Gateway or ingress routes return errors.

1. Health Checks

kubectl -n loki get pods,svc,pvc,ingressroute
kubectl -n loki logs deploy/loki-gateway --tail=200

2. Troubleshooting Workflows

Check the failing path first: gateway, ingestion, storage, or query.

kubectl -n loki describe pod <failing-pod>
kubectl -n loki get secret
kubectl -n loki logs <failing-pod> --tail=200

Inspect object-store credentials, compactor state, and ingress health.

3. Disaster Recovery

  1. Restore storage credentials and object-store access.
  2. Restore PVC-backed state if the deployment mode uses local storage.
  3. Reconcile the environment-specific Loki path.
  4. Validate ingestion and query APIs.

4. Scaling and Resource Management

kubectl -n loki top pod

Tune resources or replica counts in Git for the saturated component only.

5. Maintenance Procedures

  • Review retention and compaction settings.
  • Rotate storage credentials.
  • Validate the active Loki mode before applying chart upgrades.

6. Rollback Strategy

  • Revert the environment-specific Loki revision.
  • Restore storage state if an upgrade or mode change corrupted ingestion.

7. Post-Incident Actions

  1. Capture manual recovery in a changelog fragment.
  2. Update the service page if deployment modes or clusters changed.
  3. Add the exact failure signature to this runbook.