prometheus Runbook¶

Metadata¶

Field	Value
Service	prometheus
Criticality	Tier 1
Owner	Platform / Observability owner
Namespace	prometheus
Clusters	jls
Last validated	2026-05-20
Related service page	../services/prometheus.md

Trigger Conditions¶

Alerting stops.
Scrape targets are missing.
TSDB storage is full or unavailable.
External Prometheus or Alertmanager endpoints fail.

1. Health Checks¶

kubectl -n prometheus get pods,svc,pvc,ingressroute
kubectl -n prometheus logs statefulset/prometheus-server --tail=200
kubectl -n prometheus logs deploy/alertmanager --tail=100

2. Troubleshooting Workflows¶

Check scraping, alerting, and storage:

kubectl -n prometheus describe statefulset prometheus-server
kubectl -n prometheus get configmap
kubectl -n prometheus get secret

Look for rule errors, remote-write failures, and disk pressure.

3. Disaster Recovery¶

Restore TSDB storage and alerting secrets.
Reconcile the rendered overlay.
Confirm scrape targets recover.
Trigger and validate a test alert path.

4. Scaling and Resource Management¶

kubectl -n prometheus top pod

Increase storage, memory, or retention controls in Git when scrape volume outgrows the current profile.

5. Maintenance Procedures¶

Rotate alert receiver and remote-write secrets.
Review rule changes before merges.
Plan retention changes carefully because TSDB rewrites can be expensive.

6. Rollback Strategy¶

Revert the overlay to the previous working revision.
Restore the prior TSDB snapshot if a config or upgrade damaged the data path.

7. Post-Incident Actions¶

Add a changelog fragment covering manual recovery.
Update the service page if endpoints or integrations changed.
Add the observed failure mode to this runbook.