k8s-monitoring
| Field |
Value |
| Service |
k8s-monitoring |
| Purpose |
Collect, transform, and forward metrics, logs, events, and traces from cluster workloads to the shared observability backends |
| Criticality |
Tier 1 |
| Owner |
Platform / Observability owner |
| Clusters |
homelab, local, jls |
| Namespace |
k8s-monitoring |
| Exposure |
Internal by default, with external OTLP and tracing receiver endpoints on selected overlays |
| Stateful |
yes |
| Backup class |
Best-effort PVC and hostPath state, Git-backed configuration |
| RPO / RTO |
Telemetry loss is acceptable during rebuild, 30 to 60 minutes to restore collection paths |
| Last reviewed |
2026-04-17 |
1. Service Overview
k8s-monitoring is the shared observability pipeline built from the Grafana k8s-monitoring Helm chart plus the Alloy operator CRD. It collects cluster metrics, pod logs, events, and optional receiver traffic, then forwards telemetry to shared Prometheus, Loki, and Tempo destinations.
Summary
The service is not part of workload delivery, but it is essential for diagnosing incidents and maintaining platform visibility. The repository carries several overlay versions in parallel, and cluster-specific values determine whether Alloy receivers are enabled and externally exposed.
Dependencies
| Dependency |
Type |
Why it matters |
| Grafana k8s-monitoring Helm chart |
packaging |
The workload is rendered by Helm through Kustomize |
| Alloy operator CRD |
operator dependency |
Collector resources depend on the Alloy custom resources being registered |
| Prometheus and Loki endpoints on mutana.site |
telemetry backend |
Remote write and log push paths terminate there |
| Grafana Cloud Tempo |
trace backend |
OTLP traces are forwarded to the hosted Tempo endpoint |
| Traefik and Authelia |
ingress |
Receiver-enabled overlays expose OTLP, Jaeger, and Zipkin endpoints through Traefik |
| Storage classes and hostPath mounts |
state |
Alloy WAL and log positions depend on PVCs and hostPath mounts |
2. Architecture Diagram
[Cluster metrics, logs, events]
-> [Alloy metrics / logs / singleton collectors]
-> [Prometheus remote write and Loki push on mutana.site]
[External OTLP and tracing clients]
-> [Traefik IngressRoute]
-> [Alloy receiver service]
-> [Grafana Cloud Tempo]
3. Deployment Specifications
| Item |
Value |
| Source path |
k8s-monitoring/overlays/* |
| Deployment model |
Helm-via-Kustomize plus Alloy operator CRD resources |
| Namespace |
k8s-monitoring |
| Workload kind |
Helm-managed Deployments, DaemonSets, StatefulSets, Services, and Alloy custom resources |
| Chart or image version |
homelab 3.0.2, local uses the grafana k8s-monitoring chart with Alloy operator CRD, jls 3.7.2 |
| Config files |
overlays//kustomization.yaml, overlays//values.yaml, optional alloy-receivers-ingressroute.yaml |
Cluster mapping
| Cluster |
Overlay path |
Notes |
| homelab |
k8s-monitoring/overlays/homelab |
Chart 3.0.2, receiver disabled |
| local |
k8s-monitoring/overlays/local |
Mixed arm64 and amd64 nodes; alloy-metrics StatefulSet uses local-path WAL; alloy-logs and alloy-singleton run as DaemonSets on all nodes; alloy-receiver enabled and exposed via Traefik on oci-arm |
| jls |
k8s-monitoring/overlays/jls |
Chart 3.7.2, receiver enabled and exposed, jelastic-dynamic-volume for Alloy WAL PVCs |
4. Configuration Guide
The values files define telemetry destinations, collector presets, storage, and optional receiver exposure. Configuration differs materially between overlays.
Environment variables
| Variable |
Source |
Purpose |
Secret? |
| Generated by the Helm chart |
chart values and templates |
Collector runtime settings are rendered from values.yaml rather than declared manually in the repository |
mixed |
ConfigMaps
| Resource |
Path |
Purpose |
| Chart-generated Alloy configuration |
k8s-monitoring/overlays/*/values.yaml |
Collector pipelines, custom scrape jobs, and destination wiring are rendered from Helm values |
Secrets management
- Secret names: authsecret for receiver basic auth on overlays that include alloy-receivers-ingressroute.yaml
- Source of truth: the receiver auth secret is committed as Kubernetes Secret data inside alloy-receivers-ingressroute.yaml, while destination authentication is currently embedded inline in several values.yaml files
- Rotation trigger: credential rotation for Prometheus, Loki, Grafana Cloud Tempo, or receiver basic auth
- Recovery note: treat the inline credential pattern as technical debt and migrate destination authentication to Secret references before broader rollout
Notable configuration facts from the current overlays:
- local uses local-path for alloy-metrics WAL PVC; alloy-logs and alloy-singleton use hostPath at /var/alloy-log-storage on each node.
- The local cluster is mixed-architecture (oci-arm and oci-arm-free1 are arm64; layer7-vps1 is amd64); no architecture-specific nodeSelector is set for DaemonSets or the receiver, so they run on all nodes.
- alloy-metrics is a StatefulSet and will pin to the node where its local-path PVC was first provisioned; review the PVC's selected-node annotation if the pod cannot reschedule.
- The ingress node for the local cluster is oci-arm (node.io/ingress=true, arm64); alloy-receiver is exposed through Traefik running on that node.
- local and jls enable alloy-receiver and publish OTLP, Jaeger, and Zipkin ports.
5. Access Protocols
| Path |
URL or endpoint |
Audience |
Auth |
TLS terminates at |
| Internal |
k8s-monitoring-alloy-receiver.k8s-monitoring.svc.cluster.local on ports 4318, 4317, 14250, 6832, 6831, 14268, and 9411 when receiver is enabled |
Cluster workloads and operators |
Service-specific or none inside the cluster |
Service or ingress path dependent |
| External |
otlp-http.mutana.site, otlp-grpc.mutana.site, jaeger-grpc.mutana.site, jaeger-binary.mutana.site, jaeger-compact.mutana.site, jaeger-http.mutana.site, zipkin.mutana.site on receiver-enabled overlays |
Telemetry clients and operators |
Traefik middleware chain plus basic auth middleware alloy-auth |
Traefik |
6. Operations and Observability
- Primary health indicators: collector pods Ready, remote write pipelines healthy, and no sustained backlog in Alloy WAL or log position storage.
- Dashboards or alerts: Prometheus remote write status, Loki push success, and Alloy pod logs.
- Log locations: alloy-metrics, alloy-logs, alloy-singleton, and alloy-receiver pod logs in k8s-monitoring.
- Known failure modes: invalid destination credentials, receiver ingress misconfiguration, local-path PVC pending because a node has been lost or a PV is node-affined to the wrong node, and overlay drift between chart versions.
7. Backup and Recovery Notes
- Backup method: configuration is Git-backed, but Alloy WAL and log-position state are best-effort and can be recreated with temporary telemetry loss.
- Restore prerequisites: namespace k8s-monitoring, Alloy CRD registration, destination credentials, and the relevant storage class or hostPath mounts.
- Related runbook: ../runbooks/k8s-monitoring.md
8. Release and Change Notes
- Current deployed app version: homelab uses chart 3.0.2, local and jls run the current overlay without a pinned chart version in the kustomization helmCharts block.
- Current chart version: per overlay, as listed in the cluster mapping table.
- Last significant change: consolidated former layer7, oci, and oci-free overlays into a single local overlay for the unified k3s cluster; renamed overlays/layer7 to overlays/local; removed amd64-specific nodeSelectors from collectors; set cluster.name to local.
- Rollback reference: revert the affected overlay values.yaml and kustomization.yaml revision, then rebuild the overlay with Kustomize and Helm enabled.