Skip to content

prometheus

Metadata

Field Value
Service prometheus
Purpose Metrics collection, alerting, and time-series storage
Criticality Tier 1
Owner Platform / Observability owner
Clusters jls
Namespace prometheus
Exposure internet
Stateful yes
Backup class snapshot
RPO / RTO Daily backup target, 2 to 6 hours to restore
Last reviewed 2026-05-20

1. Service Overview

Prometheus provides metrics collection, retention, and alert delivery for the JLS cluster deployment represented in this repository.

Summary

If Prometheus fails, metrics queries and alerting stop for the affected environment.

Dependencies

Dependency Type Why it matters
Alertmanager alerting Receives and routes alerts
Traefik ingress Exposes Prometheus and Alertmanager where configured
Persistent storage storage Stores TSDB data and alerting state

2. Architecture Diagram

[Targets / exporters]
  -> [Prometheus]
  -> [Alertmanager]
  -> [Operators]

3. Deployment Specifications

Item Value
Source path prometheus/overlays/jls
Deployment model Helm chart rendered into a Fleet-managed overlay
Namespace prometheus
Workload kind StatefulSets and Deployments
Chart or image version See the rendered chart version under overlays/jls
Config files overlays/jls plus root fleet.yaml

Cluster mapping

Cluster Overlay path Notes
jls prometheus/overlays/jls Current JLS deployment

4. Configuration Guide

Environment variables

Variable Source Purpose Secret?
Helm values-driven settings rendered chart values and secrets Configure scraping, alerting, and external integrations mixed

ConfigMaps

Resource Path Purpose
Helm-generated ConfigMaps prometheus/overlays/jls Rule files, scrape config, and chart runtime config

Secrets management

  • Secret names: alerting, remote-write, and ingress-related secrets in the prometheus namespace
  • Source of truth: chart values and runtime secret material
  • Rotation trigger: remote-write or alert receiver changes
  • Recovery note: restore alerting and remote-write secrets before restarting pods

5. Access Protocols

Path URL or endpoint Audience Auth TLS terminates at
Internal Prometheus and Alertmanager services in the namespace Platform workloads cluster RBAC Service / ingress
External https://prometheus.mutana.site and related ingress endpoints Operators ingress auth policy Traefik

6. Operations and Observability

  • Primary health indicators: scrape targets healthy, TSDB writable, and alerts delivered.
  • Dashboards or alerts: shared Grafana plus Prometheus self-monitoring.
  • Log locations: Prometheus and Alertmanager pod logs.
  • Known failure modes: disk pressure, scrape misconfiguration, alert receiver errors, or ingress failures.

7. Backup and Recovery Notes

  • Backup method: TSDB PVC snapshot and alerting config backup.
  • Restore prerequisites: restored persistent storage and runtime secrets.
  • Related runbook: ../runbooks/prometheus.md

8. Release and Change Notes

  • Current deployed app version: see the rendered chart under overlays/jls.
  • Current chart version: see the chart version embedded under overlays/jls.
  • Last significant change: current repository state documents the Grafana Cloud integration and rendered overlay path.
  • Rollback reference: previous overlay revision in Git.