prometheus Runbook¶
Metadata¶
| Field | Value |
|---|---|
| Service | prometheus |
| Criticality | Tier 1 |
| Owner | Platform / Observability owner |
| Namespace | prometheus |
| Clusters | jls |
| Last validated | 2026-05-20 |
| Related service page | ../services/prometheus.md |
Trigger Conditions¶
- Alerting stops.
- Scrape targets are missing.
- TSDB storage is full or unavailable.
- External Prometheus or Alertmanager endpoints fail.
1. Health Checks¶
kubectl -n prometheus get pods,svc,pvc,ingressroute
kubectl -n prometheus logs statefulset/prometheus-server --tail=200
kubectl -n prometheus logs deploy/alertmanager --tail=100
2. Troubleshooting Workflows¶
Check scraping, alerting, and storage:
kubectl -n prometheus describe statefulset prometheus-server
kubectl -n prometheus get configmap
kubectl -n prometheus get secret
Look for rule errors, remote-write failures, and disk pressure.
3. Disaster Recovery¶
- Restore TSDB storage and alerting secrets.
- Reconcile the rendered overlay.
- Confirm scrape targets recover.
- Trigger and validate a test alert path.
4. Scaling and Resource Management¶
Increase storage, memory, or retention controls in Git when scrape volume outgrows the current profile.
5. Maintenance Procedures¶
- Rotate alert receiver and remote-write secrets.
- Review rule changes before merges.
- Plan retention changes carefully because TSDB rewrites can be expensive.
6. Rollback Strategy¶
- Revert the overlay to the previous working revision.
- Restore the prior TSDB snapshot if a config or upgrade damaged the data path.
7. Post-Incident Actions¶
- Add a changelog fragment covering manual recovery.
- Update the service page if endpoints or integrations changed.
- Add the observed failure mode to this runbook.