k8s-monitoring Runbook¶
Metadata¶
| Field | Value |
|---|---|
| Service | k8s-monitoring |
| Criticality | Tier 1 |
| Owner | Platform / Observability owner |
| Namespace | k8s-monitoring |
| Clusters | homelab, local, jls |
| Last validated | 2026-04-17 |
| Related service page | ../services/k8s-monitoring.md |
Trigger Conditions¶
- Metrics, logs, or traces disappear from the shared observability backends.
- Alloy collectors restart repeatedly or accumulate remote write errors.
- OTLP, Jaeger, or Zipkin receiver endpoints stop accepting traffic.
- Helm or overlay upgrades change chart behavior between clusters.
1. Health Checks¶
Use these commands first to establish scope.
kubectl -n k8s-monitoring get deploy,statefulset,daemonset,pod,svc,pvc
kubectl -n k8s-monitoring get pods -o wide
kubectl -n k8s-monitoring get events --sort-by='.lastTimestamp' | tail -n 20
kubectl -n k8s-monitoring logs pod/<pod-name> --all-containers --tail=200
Probe verification¶
Collector workloads are rendered by the Helm chart, so verify readiness on the concrete Alloy pod showing trouble.
kubectl -n k8s-monitoring describe pod <alloy-pod-name>
kubectl -n k8s-monitoring get svc k8s-monitoring-alloy-receiver
Record:
- whether the failing component is alloy-metrics, alloy-logs, alloy-singleton, or alloy-receiver
- whether the failure is storage, auth, destination reachability, or receiver ingress related
- whether only one cluster overlay is affected because of version skew
2. Troubleshooting Workflows¶
Remote write or push failures to Prometheus or Loki¶
kubectl -n k8s-monitoring logs pod/<alloy-pod-name> --all-containers --tail=400 | grep -Ei '401|403|429|5..|remote_write|loki'
kubectl -n k8s-monitoring get configmap -o name | grep alloy
Check:
- destination URL is still reachable
- authentication values are valid for Prometheus, Loki, and Tempo
- errors are backpressure related rather than auth related
WAL or log-position storage failures¶
kubectl -n k8s-monitoring get pvc
kubectl -n k8s-monitoring describe pvc <pvc-name>
kubectl -n k8s-monitoring describe pod <alloy-pod-name>
Check:
- storageClass is present on the cluster
- hostPath /var/alloy-log-storage exists and is writable on the nodes running alloy-logs or alloy-singleton DaemonSet pods
- WAL volume is not full or stuck Pending
- for local cluster, confirm that the alloy-metrics StatefulSet PVC is node-affined to the correct node via volume.kubernetes.io/selected-node or by local-path-provisioner auto-affinity
Receiver endpoint is exposed but clients cannot connect¶
kubectl -n k8s-monitoring get ingressroute,middleware,secret
kubectl -n traefik logs deploy/traefik --tail=200 | grep -Ei 'otlp|jaeger|zipkin|k8s-monitoring'
kubectl -n k8s-monitoring get svc k8s-monitoring-alloy-receiver -o yaml
Check:
- receiver is enabled in the active overlay
- alloy-receivers-ingressroute.yaml is actually included in the kustomization
- authsecret still exists and the Traefik middleware chain is valid
Overlay drift or cluster-specific render problem¶
kubectl kustomize k8s-monitoring/overlays/<cluster> --enable-helm > /tmp/k8s-monitoring-rendered.yaml
grep -n 'alloy-receiver' /tmp/k8s-monitoring-rendered.yaml | head
Check:
- chart version and values shape are still compatible
- the intended collector set actually renders for the target cluster
- for the local overlay, verify no architecture-specific nodeSelector blocks DaemonSet pods on mixed-arch nodes
3. Disaster Recovery¶
Preconditions¶
- Confirm whether the priority is restoring telemetry ingestion or preserving short-lived collector state.
- Identify the active overlay and chart version for the affected cluster.
- Verify destination credentials and receiver basic auth secrets are available.
Stateful workload recovery¶
Most k8s-monitoring state is disposable. Recovery sequence:
- Reapply the affected overlay with Helm enabled.
- Recreate receiver auth secrets or ingress resources if exposure is required.
- Restore or recreate PVCs and hostPath paths only if WAL continuity is needed.
- Validate remote write, log push, and receiver health.
Cluster rebuild dependency order¶
- Storage classes and node hostPath prerequisites
- Traefik if receiver endpoints must be exposed externally
- Alloy operator CRD registration
- Helm-rendered k8s-monitoring workloads
- Remote write and tracing validation
4. Scaling and Resource Management¶
Preferred path: adjust the overlay values.yaml and redeploy through GitOps.
Use these commands to size the problem before changing resources:
kubectl -n k8s-monitoring top pod
kubectl -n k8s-monitoring get statefulset,daemonset,deploy
kubectl -n k8s-monitoring describe pod <alloy-pod-name>
Record:
- which collector is saturated
- whether receiver load is driving memory pressure on alloy-receiver
- whether the chart currently uses PVC-backed or hostPath-backed local state
Current guidance: scale collector resources through values.yaml and avoid ad hoc live patches that are not reconciled back into the overlay.
5. Maintenance Procedures¶
- Review destination credentials and move them out of inline values where possible.
- Upgrade chart versions cluster by cluster and compare rendered workloads.
- Validate receiver exposure and auth middleware after Traefik changes.
- Periodically clear or resize WAL storage only with an explicit telemetry-loss decision.
For each task, define:
- Preconditions: target cluster overlay identified and destination credentials available
- Impact window: temporary telemetry loss or receiver unavailability
- Rollback path: restore previous values.yaml and kustomization.yaml revision
- Validation steps: confirm logs, metrics, and traces are flowing again
6. Rollback Strategy¶
Document the fastest safe rollback path:
- Revert the affected values.yaml or kustomization.yaml revision.
- Restore the previous receiver ingress resources if exposure changed.
- If WAL corruption is suspected, accept telemetry loss and recreate local state after confirming the service can resume cleanly.
7. Post-Incident Actions¶
After recovery, always:
- Update CHANGELOG.md if credentials, receiver exposure, or chart versions were changed during the incident.
- Update the k8s-monitoring service page if overlay versions, receiver status, or backend destinations changed.
- Update this runbook if the incident exposed missing render, storage, or auth diagnostics.
- Capture follow-up work to externalize inline credentials and confirm the local overlay's node placement for the alloy-metrics StatefulSet.