k8s-monitoring Runbook¶

Metadata¶

Field	Value
Service	k8s-monitoring
Criticality	Tier 1
Owner	Platform / Observability owner
Namespace	k8s-monitoring
Clusters	homelab, local, jls
Last validated	2026-04-17
Related service page	../services/k8s-monitoring.md

Trigger Conditions¶

Metrics, logs, or traces disappear from the shared observability backends.
Alloy collectors restart repeatedly or accumulate remote write errors.
OTLP, Jaeger, or Zipkin receiver endpoints stop accepting traffic.
Helm or overlay upgrades change chart behavior between clusters.

1. Health Checks¶

Use these commands first to establish scope.

kubectl -n k8s-monitoring get deploy,statefulset,daemonset,pod,svc,pvc
kubectl -n k8s-monitoring get pods -o wide
kubectl -n k8s-monitoring get events --sort-by='.lastTimestamp' | tail -n 20
kubectl -n k8s-monitoring logs pod/<pod-name> --all-containers --tail=200

Probe verification¶

Collector workloads are rendered by the Helm chart, so verify readiness on the concrete Alloy pod showing trouble.

kubectl -n k8s-monitoring describe pod <alloy-pod-name>
kubectl -n k8s-monitoring get svc k8s-monitoring-alloy-receiver

Record:

whether the failing component is alloy-metrics, alloy-logs, alloy-singleton, or alloy-receiver
whether the failure is storage, auth, destination reachability, or receiver ingress related
whether only one cluster overlay is affected because of version skew

2. Troubleshooting Workflows¶

Remote write or push failures to Prometheus or Loki¶

kubectl -n k8s-monitoring logs pod/<alloy-pod-name> --all-containers --tail=400 | grep -Ei '401|403|429|5..|remote_write|loki'
kubectl -n k8s-monitoring get configmap -o name | grep alloy

Check:

destination URL is still reachable
authentication values are valid for Prometheus, Loki, and Tempo
errors are backpressure related rather than auth related

WAL or log-position storage failures¶

kubectl -n k8s-monitoring get pvc
kubectl -n k8s-monitoring describe pvc <pvc-name>
kubectl -n k8s-monitoring describe pod <alloy-pod-name>

Check:

storageClass is present on the cluster
hostPath /var/alloy-log-storage exists and is writable on the nodes running alloy-logs or alloy-singleton DaemonSet pods
WAL volume is not full or stuck Pending
for local cluster, confirm that the alloy-metrics StatefulSet PVC is node-affined to the correct node via volume.kubernetes.io/selected-node or by local-path-provisioner auto-affinity

Receiver endpoint is exposed but clients cannot connect¶

kubectl -n k8s-monitoring get ingressroute,middleware,secret
kubectl -n traefik logs deploy/traefik --tail=200 | grep -Ei 'otlp|jaeger|zipkin|k8s-monitoring'
kubectl -n k8s-monitoring get svc k8s-monitoring-alloy-receiver -o yaml

Check:

receiver is enabled in the active overlay
alloy-receivers-ingressroute.yaml is actually included in the kustomization
authsecret still exists and the Traefik middleware chain is valid

Overlay drift or cluster-specific render problem¶

kubectl kustomize k8s-monitoring/overlays/<cluster> --enable-helm > /tmp/k8s-monitoring-rendered.yaml
grep -n 'alloy-receiver' /tmp/k8s-monitoring-rendered.yaml | head

Check:

chart version and values shape are still compatible
the intended collector set actually renders for the target cluster
for the local overlay, verify no architecture-specific nodeSelector blocks DaemonSet pods on mixed-arch nodes

3. Disaster Recovery¶

Preconditions¶

Confirm whether the priority is restoring telemetry ingestion or preserving short-lived collector state.
Identify the active overlay and chart version for the affected cluster.
Verify destination credentials and receiver basic auth secrets are available.

Stateful workload recovery¶

Most k8s-monitoring state is disposable. Recovery sequence:

Reapply the affected overlay with Helm enabled.
Recreate receiver auth secrets or ingress resources if exposure is required.
Restore or recreate PVCs and hostPath paths only if WAL continuity is needed.
Validate remote write, log push, and receiver health.

Cluster rebuild dependency order¶

Storage classes and node hostPath prerequisites
Traefik if receiver endpoints must be exposed externally
Alloy operator CRD registration
Helm-rendered k8s-monitoring workloads
Remote write and tracing validation

4. Scaling and Resource Management¶

Preferred path: adjust the overlay values.yaml and redeploy through GitOps.

Use these commands to size the problem before changing resources:

kubectl -n k8s-monitoring top pod
kubectl -n k8s-monitoring get statefulset,daemonset,deploy
kubectl -n k8s-monitoring describe pod <alloy-pod-name>

Record:

which collector is saturated
whether receiver load is driving memory pressure on alloy-receiver
whether the chart currently uses PVC-backed or hostPath-backed local state

Current guidance: scale collector resources through values.yaml and avoid ad hoc live patches that are not reconciled back into the overlay.

5. Maintenance Procedures¶

Review destination credentials and move them out of inline values where possible.
Upgrade chart versions cluster by cluster and compare rendered workloads.
Validate receiver exposure and auth middleware after Traefik changes.
Periodically clear or resize WAL storage only with an explicit telemetry-loss decision.

For each task, define:

Preconditions: target cluster overlay identified and destination credentials available
Impact window: temporary telemetry loss or receiver unavailability
Rollback path: restore previous values.yaml and kustomization.yaml revision
Validation steps: confirm logs, metrics, and traces are flowing again

6. Rollback Strategy¶

Document the fastest safe rollback path:

Revert the affected values.yaml or kustomization.yaml revision.
Restore the previous receiver ingress resources if exposure changed.
If WAL corruption is suspected, accept telemetry loss and recreate local state after confirming the service can resume cleanly.

7. Post-Incident Actions¶

After recovery, always:

Update CHANGELOG.md if credentials, receiver exposure, or chart versions were changed during the incident.
Update the k8s-monitoring service page if overlay versions, receiver status, or backend destinations changed.
Update this runbook if the incident exposed missing render, storage, or auth diagnostics.
Capture follow-up work to externalize inline credentials and confirm the local overlay's node placement for the alloy-metrics StatefulSet.