Skip to content

prometheus Runbook

Metadata

Field Value
Service prometheus
Criticality Tier 1
Owner Platform / Observability owner
Namespace prometheus
Clusters jls
Last validated 2026-05-20
Related service page ../services/prometheus.md

Trigger Conditions

  • Alerting stops.
  • Scrape targets are missing.
  • TSDB storage is full or unavailable.
  • External Prometheus or Alertmanager endpoints fail.

1. Health Checks

kubectl -n prometheus get pods,svc,pvc,ingressroute
kubectl -n prometheus logs statefulset/prometheus-server --tail=200
kubectl -n prometheus logs deploy/alertmanager --tail=100

2. Troubleshooting Workflows

Check scraping, alerting, and storage:

kubectl -n prometheus describe statefulset prometheus-server
kubectl -n prometheus get configmap
kubectl -n prometheus get secret

Look for rule errors, remote-write failures, and disk pressure.

3. Disaster Recovery

  1. Restore TSDB storage and alerting secrets.
  2. Reconcile the rendered overlay.
  3. Confirm scrape targets recover.
  4. Trigger and validate a test alert path.

4. Scaling and Resource Management

kubectl -n prometheus top pod

Increase storage, memory, or retention controls in Git when scrape volume outgrows the current profile.

5. Maintenance Procedures

  • Rotate alert receiver and remote-write secrets.
  • Review rule changes before merges.
  • Plan retention changes carefully because TSDB rewrites can be expensive.

6. Rollback Strategy

  • Revert the overlay to the previous working revision.
  • Restore the prior TSDB snapshot if a config or upgrade damaged the data path.

7. Post-Incident Actions

  1. Add a changelog fragment covering manual recovery.
  2. Update the service page if endpoints or integrations changed.
  3. Add the observed failure mode to this runbook.