Skip to content

rancher Runbook

Metadata

Field Value
Service rancher
Criticality Tier 1
Owner Platform / Cluster management owner
Namespace cattle-system
Clusters prod, oci
Last validated 2026-05-20
Related service page ../services/rancher.md

Trigger Conditions

  • Rancher UI or API is unavailable.
  • Managed clusters show disconnected.
  • Fleet views or cluster inventories are stale.
  • TLS or ingress issues block operator access.

1. Health Checks

kubectl -n cattle-system get pods,svc,ingressroute
kubectl -n cattle-system logs deploy/rancher --tail=200

2. Troubleshooting Workflows

Check Rancher availability, certificates, and cluster-agent health.

kubectl -n cattle-system describe deploy rancher
kubectl get cluster.management.cattle.io
kubectl get pods -A | grep cattle

Focus on expired certificates, broken ingress, and degraded management-cluster connectivity.

3. Disaster Recovery

  1. Restore TLS and bootstrap admin secrets.
  2. Restore management-cluster backup if required.
  3. Reconcile the active Rancher overlay.
  4. Validate login, managed cluster connectivity, and Fleet views.

4. Scaling and Resource Management

kubectl -n cattle-system top pod

Increase Rancher resources in Git when UI/API latency or large-cluster inventories strain the deployment.

5. Maintenance Procedures

  • Rotate admin and TLS materials.
  • Revalidate ingress and DNS before certificate renewals.
  • Plan chart upgrades during operator availability windows.

6. Rollback Strategy

  • Revert the active overlay or chart values.
  • Restore the previous management-plane backup if the upgrade corrupted state.

7. Post-Incident Actions

  1. Add a changelog fragment for recovery work.
  2. Update the service page if hostnames or deployment targets changed.
  3. Extend this runbook with the exact failure mode.