Skip to content

k8s-monitoring Runbook

Metadata

Field Value
Service k8s-monitoring
Criticality Tier 1
Owner Platform / Observability owner
Namespace k8s-monitoring
Clusters homelab, local, jls
Last validated 2026-04-17
Related service page ../services/k8s-monitoring.md

Trigger Conditions

  • Metrics, logs, or traces disappear from the shared observability backends.
  • Alloy collectors restart repeatedly or accumulate remote write errors.
  • OTLP, Jaeger, or Zipkin receiver endpoints stop accepting traffic.
  • Helm or overlay upgrades change chart behavior between clusters.

1. Health Checks

Use these commands first to establish scope.

kubectl -n k8s-monitoring get deploy,statefulset,daemonset,pod,svc,pvc
kubectl -n k8s-monitoring get pods -o wide
kubectl -n k8s-monitoring get events --sort-by='.lastTimestamp' | tail -n 20
kubectl -n k8s-monitoring logs pod/<pod-name> --all-containers --tail=200

Probe verification

Collector workloads are rendered by the Helm chart, so verify readiness on the concrete Alloy pod showing trouble.

kubectl -n k8s-monitoring describe pod <alloy-pod-name>
kubectl -n k8s-monitoring get svc k8s-monitoring-alloy-receiver

Record:

  • whether the failing component is alloy-metrics, alloy-logs, alloy-singleton, or alloy-receiver
  • whether the failure is storage, auth, destination reachability, or receiver ingress related
  • whether only one cluster overlay is affected because of version skew

2. Troubleshooting Workflows

Remote write or push failures to Prometheus or Loki

kubectl -n k8s-monitoring logs pod/<alloy-pod-name> --all-containers --tail=400 | grep -Ei '401|403|429|5..|remote_write|loki'
kubectl -n k8s-monitoring get configmap -o name | grep alloy

Check:

  • destination URL is still reachable
  • authentication values are valid for Prometheus, Loki, and Tempo
  • errors are backpressure related rather than auth related

WAL or log-position storage failures

kubectl -n k8s-monitoring get pvc
kubectl -n k8s-monitoring describe pvc <pvc-name>
kubectl -n k8s-monitoring describe pod <alloy-pod-name>

Check:

  • storageClass is present on the cluster
  • hostPath /var/alloy-log-storage exists and is writable on the nodes running alloy-logs or alloy-singleton DaemonSet pods
  • WAL volume is not full or stuck Pending
  • for local cluster, confirm that the alloy-metrics StatefulSet PVC is node-affined to the correct node via volume.kubernetes.io/selected-node or by local-path-provisioner auto-affinity

Receiver endpoint is exposed but clients cannot connect

kubectl -n k8s-monitoring get ingressroute,middleware,secret
kubectl -n traefik logs deploy/traefik --tail=200 | grep -Ei 'otlp|jaeger|zipkin|k8s-monitoring'
kubectl -n k8s-monitoring get svc k8s-monitoring-alloy-receiver -o yaml

Check:

  • receiver is enabled in the active overlay
  • alloy-receivers-ingressroute.yaml is actually included in the kustomization
  • authsecret still exists and the Traefik middleware chain is valid

Overlay drift or cluster-specific render problem

kubectl kustomize k8s-monitoring/overlays/<cluster> --enable-helm > /tmp/k8s-monitoring-rendered.yaml
grep -n 'alloy-receiver' /tmp/k8s-monitoring-rendered.yaml | head

Check:

  • chart version and values shape are still compatible
  • the intended collector set actually renders for the target cluster
  • for the local overlay, verify no architecture-specific nodeSelector blocks DaemonSet pods on mixed-arch nodes

3. Disaster Recovery

Preconditions

  • Confirm whether the priority is restoring telemetry ingestion or preserving short-lived collector state.
  • Identify the active overlay and chart version for the affected cluster.
  • Verify destination credentials and receiver basic auth secrets are available.

Stateful workload recovery

Most k8s-monitoring state is disposable. Recovery sequence:

  1. Reapply the affected overlay with Helm enabled.
  2. Recreate receiver auth secrets or ingress resources if exposure is required.
  3. Restore or recreate PVCs and hostPath paths only if WAL continuity is needed.
  4. Validate remote write, log push, and receiver health.

Cluster rebuild dependency order

  1. Storage classes and node hostPath prerequisites
  2. Traefik if receiver endpoints must be exposed externally
  3. Alloy operator CRD registration
  4. Helm-rendered k8s-monitoring workloads
  5. Remote write and tracing validation

4. Scaling and Resource Management

Preferred path: adjust the overlay values.yaml and redeploy through GitOps.

Use these commands to size the problem before changing resources:

kubectl -n k8s-monitoring top pod
kubectl -n k8s-monitoring get statefulset,daemonset,deploy
kubectl -n k8s-monitoring describe pod <alloy-pod-name>

Record:

  • which collector is saturated
  • whether receiver load is driving memory pressure on alloy-receiver
  • whether the chart currently uses PVC-backed or hostPath-backed local state

Current guidance: scale collector resources through values.yaml and avoid ad hoc live patches that are not reconciled back into the overlay.

5. Maintenance Procedures

  • Review destination credentials and move them out of inline values where possible.
  • Upgrade chart versions cluster by cluster and compare rendered workloads.
  • Validate receiver exposure and auth middleware after Traefik changes.
  • Periodically clear or resize WAL storage only with an explicit telemetry-loss decision.

For each task, define:

  • Preconditions: target cluster overlay identified and destination credentials available
  • Impact window: temporary telemetry loss or receiver unavailability
  • Rollback path: restore previous values.yaml and kustomization.yaml revision
  • Validation steps: confirm logs, metrics, and traces are flowing again

6. Rollback Strategy

Document the fastest safe rollback path:

  • Revert the affected values.yaml or kustomization.yaml revision.
  • Restore the previous receiver ingress resources if exposure changed.
  • If WAL corruption is suspected, accept telemetry loss and recreate local state after confirming the service can resume cleanly.

7. Post-Incident Actions

After recovery, always:

  1. Update CHANGELOG.md if credentials, receiver exposure, or chart versions were changed during the incident.
  2. Update the k8s-monitoring service page if overlay versions, receiver status, or backend destinations changed.
  3. Update this runbook if the incident exposed missing render, storage, or auth diagnostics.
  4. Capture follow-up work to externalize inline credentials and confirm the local overlay's node placement for the alloy-metrics StatefulSet.