Skip to content

K8s Homelab Knowledge Base

loki

loki Runbook¶

Metadata¶

Field	Value
Service	loki
Criticality	Tier 1
Owner	Platform / Observability owner
Namespace	loki
Clusters	prod, jls, layer7, oci, oci-free
Last validated	2026-05-20
Related service page	../services/loki.md

Trigger Conditions¶

Log ingestion stops.
Queries fail or time out.
Object-store auth errors appear.
Gateway or ingress routes return errors.

1. Health Checks¶

kubectl -n loki get pods,svc,pvc,ingressroute
kubectl -n loki logs deploy/loki-gateway --tail=200

2. Troubleshooting Workflows¶

Check the failing path first: gateway, ingestion, storage, or query.

kubectl -n loki describe pod <failing-pod>
kubectl -n loki get secret
kubectl -n loki logs <failing-pod> --tail=200

Inspect object-store credentials, compactor state, and ingress health.

3. Disaster Recovery¶

Restore storage credentials and object-store access.
Restore PVC-backed state if the deployment mode uses local storage.
Reconcile the environment-specific Loki path.
Validate ingestion and query APIs.

4. Scaling and Resource Management¶

kubectl -n loki top pod

Tune resources or replica counts in Git for the saturated component only.

5. Maintenance Procedures¶

Review retention and compaction settings.
Rotate storage credentials.
Validate the active Loki mode before applying chart upgrades.

6. Rollback Strategy¶

Revert the environment-specific Loki revision.
Restore storage state if an upgrade or mode change corrupted ingestion.

7. Post-Incident Actions¶

Capture manual recovery in a changelog fragment.
Update the service page if deployment modes or clusters changed.
Add the exact failure signature to this runbook.