fleet Runbook¶

Metadata¶

Field	Value
Service	fleet
Criticality	Tier 0
Owner	Platform / GitOps owner
Namespace	fleet-default
Clusters	homelab, local, jls, ozirke01
Last validated	2026-04-22
Related service page	../services/fleet.md

Trigger Conditions¶

GitRepo reports NotReady or stalled bundle generation.
Expected manifest changes merged to main do not reach a target cluster.
BundleDeployment objects remain failed after a repository change.
Emergency drift reconciliation is required after a manual intervention.

1. Health Checks¶

Use these commands first to establish scope.

kubectl -n fleet-default get gitrepos,bundles,bundledeployments
kubectl -n fleet-default describe gitrepo k8s-apps
kubectl -n fleet-default describe gitrepo ozilab-apps
kubectl -n cattle-fleet-system logs deploy/fleet-controller --tail=200

Probe verification¶

Fleet bootstrap in this repository is expressed as GitRepo resources rather than a service with probes. Validate controller health indirectly by checking:

GitRepo Ready conditions
Recent Bundle and BundleDeployment updates
Fleet controller logs for git authentication, path, or cluster-target errors

2. Troubleshooting Workflows¶

GitRepo NotReady or authentication failures¶

kubectl -n fleet-default describe gitrepo k8s-apps
kubectl -n fleet-default describe gitrepo ozilab-apps
kubectl -n fleet-default describe gitrepo jls-apps 2>/dev/null || echo 'no separate jls gitrepo'
kubectl -n fleet-default get secret gitrepo-auth -o yaml

Check:

gitrepo-auth exists in the correct namespace (fleet-default)
repository URL and branch are still valid
target cluster selector or clusterName still matches Fleet inventory (homelab, local, jls)
special bootstraps such as ozilab still reference the intended repository, branch, and clientSecretName

Bundles are stale after a repository change¶

kubectl -n fleet-default get bundledeployments -A | grep -i error
kubectl -n fleet-default describe gitrepo k8s-apps
kubectl -n fleet-default describe gitrepo ozilab-apps

Check:

the changed overlay path still exists in the repository
Fleet path entries still match the standardized overlay layout
no migration renamed a path without updating the corresponding GitRepo manifest
for ozilab, the active path list still contains traefik and tailscale-operator and no longer depends on cloudflared

Cluster receives no updates even though GitRepo is healthy¶

kubectl get clusters.fleet.cattle.io -A
kubectl -n fleet-default get gitrepo k8s-apps -o yaml

Check:

target cluster name matches the value visible in Fleet (homelab, local, or jls)
cluster registration is still healthy in Rancher
controller polling interval is acceptable for the expected rollout speed

3. Disaster Recovery¶

Preconditions¶

Confirm whether the outage is limited to a single GitRepo, a single cluster target, or the entire Fleet control plane.
Verify access to Rancher and kubectl for the management cluster.
Confirm valid Git credentials are available for recreating gitrepo-auth.

Stateful workload recovery¶

Fleet bootstrap itself is stateless from the repository perspective. Recovery sequence:

Recreate gitrepo-auth in fleet-default.
Reapply the required GitRepo manifests from the repository.
Confirm GitRepo Ready conditions and bundle regeneration.
Validate that target clusters resume reconciliation.

Cluster rebuild dependency order¶

Rancher and Fleet control plane
Registered target cluster connectivity
gitrepo-auth secrets
GitRepo bootstrap manifests
Downstream workload reconciliation

4. Scaling and Resource Management¶

Preferred path: scale or tune the Fleet controller through the Rancher or Fleet installation source, not through ad hoc changes in this repository.

Use these commands to size the problem before changing controller resources:

kubectl -n cattle-fleet-system top pod
kubectl -n cattle-fleet-system get deploy
kubectl -n cattle-fleet-system describe deploy fleet-controller

Record:

controller CPU and memory pressure
queue backlog symptoms in logs
whether a scale issue is real or the failure is just git authentication or path drift

5. Maintenance Procedures¶

Rotate gitrepo-auth credentials in both namespaces.
Review GitRepo path entries after repository layout migrations.
Validate clusterName values against the Rancher Fleet cluster inventory.

For each task, define:

Preconditions: access to management cluster and valid Git credentials
Impact window: bundle refresh pause or delayed reconciliation
Rollback path: restore previous secret or previous GitRepo manifest revision
Validation steps: confirm GitRepo Ready and successful bundle generation

6. Rollback Strategy¶

Document the fastest safe rollback path:

Reapply the previous fleet/*/gitrepo.yaml revision.
Restore the previous gitrepo-auth secret if credentials were changed.
Revert any path migration that removed or renamed an overlay still referenced by Fleet.

7. Post-Incident Actions¶

After recovery, always:

Update CHANGELOG.md if recovery required a manual intervention or GitRepo rollback.
Update the fleet service page if targets, polling intervals, or bootstrap assumptions changed.
Update this runbook if the failure exposed a gap in GitRepo recovery or cluster-target validation.
Capture follow-up work for path standardization when legacy layer7, oci, or oci-free overlay references caused the incident.