Skip to content

fleet Runbook

Metadata

Field Value
Service fleet
Criticality Tier 0
Owner Platform / GitOps owner
Namespace fleet-default
Clusters homelab, local, jls, ozirke01
Last validated 2026-04-22
Related service page ../services/fleet.md

Trigger Conditions

  • GitRepo reports NotReady or stalled bundle generation.
  • Expected manifest changes merged to main do not reach a target cluster.
  • BundleDeployment objects remain failed after a repository change.
  • Emergency drift reconciliation is required after a manual intervention.

1. Health Checks

Use these commands first to establish scope.

kubectl -n fleet-default get gitrepos,bundles,bundledeployments
kubectl -n fleet-default describe gitrepo k8s-apps
kubectl -n fleet-default describe gitrepo ozilab-apps
kubectl -n cattle-fleet-system logs deploy/fleet-controller --tail=200

Probe verification

Fleet bootstrap in this repository is expressed as GitRepo resources rather than a service with probes. Validate controller health indirectly by checking:

  • GitRepo Ready conditions
  • Recent Bundle and BundleDeployment updates
  • Fleet controller logs for git authentication, path, or cluster-target errors

2. Troubleshooting Workflows

GitRepo NotReady or authentication failures

kubectl -n fleet-default describe gitrepo k8s-apps
kubectl -n fleet-default describe gitrepo ozilab-apps
kubectl -n fleet-default describe gitrepo jls-apps 2>/dev/null || echo 'no separate jls gitrepo'
kubectl -n fleet-default get secret gitrepo-auth -o yaml

Check:

  • gitrepo-auth exists in the correct namespace (fleet-default)
  • repository URL and branch are still valid
  • target cluster selector or clusterName still matches Fleet inventory (homelab, local, jls)
  • special bootstraps such as ozilab still reference the intended repository, branch, and clientSecretName

Bundles are stale after a repository change

kubectl -n fleet-default get bundledeployments -A | grep -i error
kubectl -n fleet-default describe gitrepo k8s-apps
kubectl -n fleet-default describe gitrepo ozilab-apps

Check:

  • the changed overlay path still exists in the repository
  • Fleet path entries still match the standardized overlay layout
  • no migration renamed a path without updating the corresponding GitRepo manifest
  • for ozilab, the active path list still contains traefik and tailscale-operator and no longer depends on cloudflared

Cluster receives no updates even though GitRepo is healthy

kubectl get clusters.fleet.cattle.io -A
kubectl -n fleet-default get gitrepo k8s-apps -o yaml

Check:

  • target cluster name matches the value visible in Fleet (homelab, local, or jls)
  • cluster registration is still healthy in Rancher
  • controller polling interval is acceptable for the expected rollout speed

3. Disaster Recovery

Preconditions

  • Confirm whether the outage is limited to a single GitRepo, a single cluster target, or the entire Fleet control plane.
  • Verify access to Rancher and kubectl for the management cluster.
  • Confirm valid Git credentials are available for recreating gitrepo-auth.

Stateful workload recovery

Fleet bootstrap itself is stateless from the repository perspective. Recovery sequence:

  1. Recreate gitrepo-auth in fleet-default.
  2. Reapply the required GitRepo manifests from the repository.
  3. Confirm GitRepo Ready conditions and bundle regeneration.
  4. Validate that target clusters resume reconciliation.

Cluster rebuild dependency order

  1. Rancher and Fleet control plane
  2. Registered target cluster connectivity
  3. gitrepo-auth secrets
  4. GitRepo bootstrap manifests
  5. Downstream workload reconciliation

4. Scaling and Resource Management

Preferred path: scale or tune the Fleet controller through the Rancher or Fleet installation source, not through ad hoc changes in this repository.

Use these commands to size the problem before changing controller resources:

kubectl -n cattle-fleet-system top pod
kubectl -n cattle-fleet-system get deploy
kubectl -n cattle-fleet-system describe deploy fleet-controller

Record:

  • controller CPU and memory pressure
  • queue backlog symptoms in logs
  • whether a scale issue is real or the failure is just git authentication or path drift

5. Maintenance Procedures

  • Rotate gitrepo-auth credentials in both namespaces.
  • Review GitRepo path entries after repository layout migrations.
  • Validate clusterName values against the Rancher Fleet cluster inventory.

For each task, define:

  • Preconditions: access to management cluster and valid Git credentials
  • Impact window: bundle refresh pause or delayed reconciliation
  • Rollback path: restore previous secret or previous GitRepo manifest revision
  • Validation steps: confirm GitRepo Ready and successful bundle generation

6. Rollback Strategy

Document the fastest safe rollback path:

  • Reapply the previous fleet/*/gitrepo.yaml revision.
  • Restore the previous gitrepo-auth secret if credentials were changed.
  • Revert any path migration that removed or renamed an overlay still referenced by Fleet.

7. Post-Incident Actions

After recovery, always:

  1. Update CHANGELOG.md if recovery required a manual intervention or GitRepo rollback.
  2. Update the fleet service page if targets, polling intervals, or bootstrap assumptions changed.
  3. Update this runbook if the failure exposed a gap in GitRepo recovery or cluster-target validation.
  4. Capture follow-up work for path standardization when legacy layer7, oci, or oci-free overlay references caused the incident.