Skip to content

K8s Homelab Knowledge Base

rancher

rancher Runbook¶

Metadata¶

Field	Value
Service	rancher
Criticality	Tier 1
Owner	Platform / Cluster management owner
Namespace	cattle-system
Clusters	prod, oci
Last validated	2026-05-20
Related service page	../services/rancher.md

Trigger Conditions¶

Rancher UI or API is unavailable.
Managed clusters show disconnected.
Fleet views or cluster inventories are stale.
TLS or ingress issues block operator access.

1. Health Checks¶

kubectl -n cattle-system get pods,svc,ingressroute
kubectl -n cattle-system logs deploy/rancher --tail=200

2. Troubleshooting Workflows¶

Check Rancher availability, certificates, and cluster-agent health.

kubectl -n cattle-system describe deploy rancher
kubectl get cluster.management.cattle.io
kubectl get pods -A | grep cattle

Focus on expired certificates, broken ingress, and degraded management-cluster connectivity.

3. Disaster Recovery¶

Restore TLS and bootstrap admin secrets.
Restore management-cluster backup if required.
Reconcile the active Rancher overlay.
Validate login, managed cluster connectivity, and Fleet views.

4. Scaling and Resource Management¶

kubectl -n cattle-system top pod

Increase Rancher resources in Git when UI/API latency or large-cluster inventories strain the deployment.

5. Maintenance Procedures¶

Rotate admin and TLS materials.
Revalidate ingress and DNS before certificate renewals.
Plan chart upgrades during operator availability windows.

6. Rollback Strategy¶

Revert the active overlay or chart values.
Restore the previous management-plane backup if the upgrade corrupted state.

7. Post-Incident Actions¶

Add a changelog fragment for recovery work.
Update the service page if hostnames or deployment targets changed.
Extend this runbook with the exact failure mode.