Skip to content

k8s-monitoring

Metadata

Field Value
Service k8s-monitoring
Purpose Collect, transform, and forward metrics, logs, events, and traces from cluster workloads to the shared observability backends
Criticality Tier 1
Owner Platform / Observability owner
Clusters homelab, local, jls
Namespace k8s-monitoring
Exposure Internal by default, with external OTLP and tracing receiver endpoints on selected overlays
Stateful yes
Backup class Best-effort PVC and hostPath state, Git-backed configuration
RPO / RTO Telemetry loss is acceptable during rebuild, 30 to 60 minutes to restore collection paths
Last reviewed 2026-04-17

1. Service Overview

k8s-monitoring is the shared observability pipeline built from the Grafana k8s-monitoring Helm chart plus the Alloy operator CRD. It collects cluster metrics, pod logs, events, and optional receiver traffic, then forwards telemetry to shared Prometheus, Loki, and Tempo destinations.

Summary

The service is not part of workload delivery, but it is essential for diagnosing incidents and maintaining platform visibility. The repository carries several overlay versions in parallel, and cluster-specific values determine whether Alloy receivers are enabled and externally exposed.

Dependencies

Dependency Type Why it matters
Grafana k8s-monitoring Helm chart packaging The workload is rendered by Helm through Kustomize
Alloy operator CRD operator dependency Collector resources depend on the Alloy custom resources being registered
Prometheus and Loki endpoints on mutana.site telemetry backend Remote write and log push paths terminate there
Grafana Cloud Tempo trace backend OTLP traces are forwarded to the hosted Tempo endpoint
Traefik and Authelia ingress Receiver-enabled overlays expose OTLP, Jaeger, and Zipkin endpoints through Traefik
Storage classes and hostPath mounts state Alloy WAL and log positions depend on PVCs and hostPath mounts

2. Architecture Diagram

[Cluster metrics, logs, events]
  -> [Alloy metrics / logs / singleton collectors]
  -> [Prometheus remote write and Loki push on mutana.site]

[External OTLP and tracing clients]
  -> [Traefik IngressRoute]
  -> [Alloy receiver service]
  -> [Grafana Cloud Tempo]

3. Deployment Specifications

Item Value
Source path k8s-monitoring/overlays/*
Deployment model Helm-via-Kustomize plus Alloy operator CRD resources
Namespace k8s-monitoring
Workload kind Helm-managed Deployments, DaemonSets, StatefulSets, Services, and Alloy custom resources
Chart or image version homelab 3.0.2, local uses the grafana k8s-monitoring chart with Alloy operator CRD, jls 3.7.2
Config files overlays//kustomization.yaml, overlays//values.yaml, optional alloy-receivers-ingressroute.yaml

Cluster mapping

Cluster Overlay path Notes
homelab k8s-monitoring/overlays/homelab Chart 3.0.2, receiver disabled
local k8s-monitoring/overlays/local Mixed arm64 and amd64 nodes; alloy-metrics StatefulSet uses local-path WAL; alloy-logs and alloy-singleton run as DaemonSets on all nodes; alloy-receiver enabled and exposed via Traefik on oci-arm
jls k8s-monitoring/overlays/jls Chart 3.7.2, receiver enabled and exposed, jelastic-dynamic-volume for Alloy WAL PVCs

4. Configuration Guide

The values files define telemetry destinations, collector presets, storage, and optional receiver exposure. Configuration differs materially between overlays.

Environment variables

Variable Source Purpose Secret?
Generated by the Helm chart chart values and templates Collector runtime settings are rendered from values.yaml rather than declared manually in the repository mixed

ConfigMaps

Resource Path Purpose
Chart-generated Alloy configuration k8s-monitoring/overlays/*/values.yaml Collector pipelines, custom scrape jobs, and destination wiring are rendered from Helm values

Secrets management

  • Secret names: authsecret for receiver basic auth on overlays that include alloy-receivers-ingressroute.yaml
  • Source of truth: the receiver auth secret is committed as Kubernetes Secret data inside alloy-receivers-ingressroute.yaml, while destination authentication is currently embedded inline in several values.yaml files
  • Rotation trigger: credential rotation for Prometheus, Loki, Grafana Cloud Tempo, or receiver basic auth
  • Recovery note: treat the inline credential pattern as technical debt and migrate destination authentication to Secret references before broader rollout

Notable configuration facts from the current overlays:

  • local uses local-path for alloy-metrics WAL PVC; alloy-logs and alloy-singleton use hostPath at /var/alloy-log-storage on each node.
  • The local cluster is mixed-architecture (oci-arm and oci-arm-free1 are arm64; layer7-vps1 is amd64); no architecture-specific nodeSelector is set for DaemonSets or the receiver, so they run on all nodes.
  • alloy-metrics is a StatefulSet and will pin to the node where its local-path PVC was first provisioned; review the PVC's selected-node annotation if the pod cannot reschedule.
  • The ingress node for the local cluster is oci-arm (node.io/ingress=true, arm64); alloy-receiver is exposed through Traefik running on that node.
  • local and jls enable alloy-receiver and publish OTLP, Jaeger, and Zipkin ports.

5. Access Protocols

Path URL or endpoint Audience Auth TLS terminates at
Internal k8s-monitoring-alloy-receiver.k8s-monitoring.svc.cluster.local on ports 4318, 4317, 14250, 6832, 6831, 14268, and 9411 when receiver is enabled Cluster workloads and operators Service-specific or none inside the cluster Service or ingress path dependent
External otlp-http.mutana.site, otlp-grpc.mutana.site, jaeger-grpc.mutana.site, jaeger-binary.mutana.site, jaeger-compact.mutana.site, jaeger-http.mutana.site, zipkin.mutana.site on receiver-enabled overlays Telemetry clients and operators Traefik middleware chain plus basic auth middleware alloy-auth Traefik

6. Operations and Observability

  • Primary health indicators: collector pods Ready, remote write pipelines healthy, and no sustained backlog in Alloy WAL or log position storage.
  • Dashboards or alerts: Prometheus remote write status, Loki push success, and Alloy pod logs.
  • Log locations: alloy-metrics, alloy-logs, alloy-singleton, and alloy-receiver pod logs in k8s-monitoring.
  • Known failure modes: invalid destination credentials, receiver ingress misconfiguration, local-path PVC pending because a node has been lost or a PV is node-affined to the wrong node, and overlay drift between chart versions.

7. Backup and Recovery Notes

  • Backup method: configuration is Git-backed, but Alloy WAL and log-position state are best-effort and can be recreated with temporary telemetry loss.
  • Restore prerequisites: namespace k8s-monitoring, Alloy CRD registration, destination credentials, and the relevant storage class or hostPath mounts.
  • Related runbook: ../runbooks/k8s-monitoring.md

8. Release and Change Notes

  • Current deployed app version: homelab uses chart 3.0.2, local and jls run the current overlay without a pinned chart version in the kustomization helmCharts block.
  • Current chart version: per overlay, as listed in the cluster mapping table.
  • Last significant change: consolidated former layer7, oci, and oci-free overlays into a single local overlay for the unified k3s cluster; renamed overlays/layer7 to overlays/local; removed amd64-specific nodeSelectors from collectors; set cluster.name to local.
  • Rollback reference: revert the affected overlay values.yaml and kustomization.yaml revision, then rebuild the overlay with Kustomize and Helm enabled.