prometheus
| Field |
Value |
| Service |
prometheus |
| Purpose |
Metrics collection, alerting, and time-series storage |
| Criticality |
Tier 1 |
| Owner |
Platform / Observability owner |
| Clusters |
jls |
| Namespace |
prometheus |
| Exposure |
internet |
| Stateful |
yes |
| Backup class |
snapshot |
| RPO / RTO |
Daily backup target, 2 to 6 hours to restore |
| Last reviewed |
2026-05-20 |
1. Service Overview
Prometheus provides metrics collection, retention, and alert delivery for the JLS cluster deployment represented in this repository.
Summary
If Prometheus fails, metrics queries and alerting stop for the affected environment.
Dependencies
| Dependency |
Type |
Why it matters |
| Alertmanager |
alerting |
Receives and routes alerts |
| Traefik |
ingress |
Exposes Prometheus and Alertmanager where configured |
| Persistent storage |
storage |
Stores TSDB data and alerting state |
2. Architecture Diagram
[Targets / exporters]
-> [Prometheus]
-> [Alertmanager]
-> [Operators]
3. Deployment Specifications
| Item |
Value |
| Source path |
prometheus/overlays/jls |
| Deployment model |
Helm chart rendered into a Fleet-managed overlay |
| Namespace |
prometheus |
| Workload kind |
StatefulSets and Deployments |
| Chart or image version |
See the rendered chart version under overlays/jls |
| Config files |
overlays/jls plus root fleet.yaml |
Cluster mapping
| Cluster |
Overlay path |
Notes |
| jls |
prometheus/overlays/jls |
Current JLS deployment |
4. Configuration Guide
Environment variables
| Variable |
Source |
Purpose |
Secret? |
| Helm values-driven settings |
rendered chart values and secrets |
Configure scraping, alerting, and external integrations |
mixed |
ConfigMaps
| Resource |
Path |
Purpose |
| Helm-generated ConfigMaps |
prometheus/overlays/jls |
Rule files, scrape config, and chart runtime config |
Secrets management
- Secret names: alerting, remote-write, and ingress-related secrets in the prometheus namespace
- Source of truth: chart values and runtime secret material
- Rotation trigger: remote-write or alert receiver changes
- Recovery note: restore alerting and remote-write secrets before restarting pods
5. Access Protocols
| Path |
URL or endpoint |
Audience |
Auth |
TLS terminates at |
| Internal |
Prometheus and Alertmanager services in the namespace |
Platform workloads |
cluster RBAC |
Service / ingress |
| External |
https://prometheus.mutana.site and related ingress endpoints |
Operators |
ingress auth policy |
Traefik |
6. Operations and Observability
- Primary health indicators: scrape targets healthy, TSDB writable, and alerts delivered.
- Dashboards or alerts: shared Grafana plus Prometheus self-monitoring.
- Log locations: Prometheus and Alertmanager pod logs.
- Known failure modes: disk pressure, scrape misconfiguration, alert receiver errors, or ingress failures.
7. Backup and Recovery Notes
- Backup method: TSDB PVC snapshot and alerting config backup.
- Restore prerequisites: restored persistent storage and runtime secrets.
- Related runbook: ../runbooks/prometheus.md
8. Release and Change Notes
- Current deployed app version: see the rendered chart under overlays/jls.
- Current chart version: see the chart version embedded under overlays/jls.
- Last significant change: current repository state documents the Grafana Cloud integration and rendered overlay path.
- Rollback reference: previous overlay revision in Git.