renovate Runbook¶
Metadata¶
| Field | Value |
|---|---|
| Service | renovate |
| Criticality | Tier 2 |
| Owner | Platform owner |
| Namespace | renovate |
| Clusters | local |
| Last validated | 2026-06-30 |
| Related service page | ../services/renovate.md |
Trigger Conditions¶
- No dependency-update PRs have appeared on the
devbranch for longer than expected. - The Renovate Dependency Dashboard issue in GitHub has stopped updating.
- Recent
Jobsunder therenovateCronJob are failing. - A token rotation, expiry, or suspected GitHub PAT compromise occurred.
1. Health Checks¶
kubectl -n renovate get cronjob,job,pod
kubectl -n renovate get events --sort-by='.lastTimestamp' | tail -n 20
kubectl -n renovate logs -l app.kubernetes.io/name=renovate --tail=200
Confirm the last scheduled run produced a successful Job:
2. Troubleshooting Workflows¶
Check the token secret and config before changing manifests.
Common causes:
- No PRs opening:
RENOVATE_TOKENis missing, expired, or lacksreposcope. Recreateoverlays/local/.renovate.envfrom.renovate.env.examplewith a fresh GitHub PAT, then re-apply the overlay or let Fleet reconcile. Do not print the token. - Jobs failing with auth errors: the GitHub token is invalid or revoked. Rotate it.
- Jobs failing with rate-limit messages: raise
prConcurrentLimitspacing or reduce scan frequency inconfig.jsonafter reviewing Renovate docs. - Invalid config: a malformed
config.jsonprevents startup. Validate JSON locally and rebuild the overlay. - Image pull failure: the pinned tag in
overlays/local/kustomization.yamlis unavailable; pick a validrenovate/renovatetag.
Force a manual run to reproduce a failure:
JOB="manual-$(date +%s)"
kubectl -n renovate create job --from=cronjob/renovate "$JOB"
kubectl -n renovate logs -l "job-name=$JOB" -f
3. Disaster Recovery¶
- Confirm whether the issue is token-only, config-only, or cluster-wide.
- Recreate
renovate-envwith a valid GitHub PAT. - Reconcile
renovate/overlays/local(or let Fleet reconcile). - Trigger a manual run and confirm a PR appears on
devwhen an update exists.
Renovate is stateless. Recovery is recreating the Secret and re-running.
4. Scaling and Resource Management¶
Resource requests/limits are set on the CronJob pod. Adjust in Git if a run is
OOM-killed; one run at a time is enforced by concurrencyPolicy: Forbid.
5. Maintenance Procedures¶
- Rotate the GitHub PAT on a schedule or after any suspected exposure. After rotation, recreate
renovate-env. - Review major-update PRs manually; majors are disabled by default and require a config change to enable.
- When changing
config.json, rebuild the overlay locally before pushing so Fleet reconciles a known-good state.
6. Rollback Strategy¶
- Revert the
renovate/Git changes to restore the previous image tag, schedule, and config. - Recreate
renovate-envif a token rotation happened in the same window. - Because the workload is stateless, rollback is a redeploy from Git.
7. Post-Incident Actions¶
- Add a changelog fragment for any recovery or security-relevant intervention (for example a token rotation).
- Update the service page if scope, schedule, or secret handling changed.
- Update this runbook with commands or checks learned during the incident.
- Record results in
context/progress-tracker.md.