rancher Runbook¶
Metadata¶
| Field | Value |
|---|---|
| Service | rancher |
| Criticality | Tier 1 |
| Owner | Platform / Cluster management owner |
| Namespace | cattle-system |
| Clusters | prod, oci |
| Last validated | 2026-05-20 |
| Related service page | ../services/rancher.md |
Trigger Conditions¶
- Rancher UI or API is unavailable.
- Managed clusters show disconnected.
- Fleet views or cluster inventories are stale.
- TLS or ingress issues block operator access.
1. Health Checks¶
kubectl -n cattle-system get pods,svc,ingressroute
kubectl -n cattle-system logs deploy/rancher --tail=200
2. Troubleshooting Workflows¶
Check Rancher availability, certificates, and cluster-agent health.
kubectl -n cattle-system describe deploy rancher
kubectl get cluster.management.cattle.io
kubectl get pods -A | grep cattle
Focus on expired certificates, broken ingress, and degraded management-cluster connectivity.
3. Disaster Recovery¶
- Restore TLS and bootstrap admin secrets.
- Restore management-cluster backup if required.
- Reconcile the active Rancher overlay.
- Validate login, managed cluster connectivity, and Fleet views.
4. Scaling and Resource Management¶
Increase Rancher resources in Git when UI/API latency or large-cluster inventories strain the deployment.
5. Maintenance Procedures¶
- Rotate admin and TLS materials.
- Revalidate ingress and DNS before certificate renewals.
- Plan chart upgrades during operator availability windows.
6. Rollback Strategy¶
- Revert the active overlay or chart values.
- Restore the previous management-plane backup if the upgrade corrupted state.
7. Post-Incident Actions¶
- Add a changelog fragment for recovery work.
- Update the service page if hostnames or deployment targets changed.
- Extend this runbook with the exact failure mode.