tailscale-operator Runbook¶
Metadata¶
| Field | Value |
|---|---|
| Service | tailscale-operator |
| Criticality | Tier 1 |
| Owner | Platform / Networking owner |
| Namespace | tailscale |
| Clusters | ozilab |
| Last validated | 2026-04-22 |
| Related service page | ../services/tailscale-operator.md |
Trigger Conditions¶
- Public ozilab routes stop responding after the Cloudflare to Tailscale cutover.
- The generated ts.net hostname does not resolve or does not answer on 443.
- The operator fails to join the tailnet.
- Funnel is enabled in manifests but not actually advertised publicly.
1. Health Checks¶
Use these commands first to establish scope.
kubectl -n tailscale get deploy,pod,svc,secret
kubectl -n tailscale get secret operator-oauth
kubectl -n tailscale logs deploy/operator --tail=200
kubectl -n traefik get ingress traefik-tailscale-funnel -o wide
kubectl -n traefik describe ingress traefik-tailscale-funnel
kubectl get ingressclass tailscale
Probe verification¶
The operator itself is a standard deployment. The generated ingress proxy pods are created dynamically in the tailscale namespace.
Record:
- whether the operator is healthy but the ingress proxy is missing
- whether the ingress has a hostname but no advertised port 443 yet
- whether the issue is cluster-side or Cloudflare-side
2. Troubleshooting Workflows¶
Operator fails to authenticate or create proxies¶
kubectl -n tailscale logs deploy/operator --tail=400
kubectl -n tailscale get secret operator-oauth -o yaml
kubectl -n tailscale describe deploy operator
Check:
- operator-oauth exists in namespace tailscale
- operator-oauth contains client_id and client_secret keys
- OAuth client has Devices Core, Auth Keys, and Services write scopes
- tag ownership in the tailnet policy allows tag:k8s-operator and tag:k8s
Ingress exists but the proxy does not expose traffic¶
kubectl -n traefik get ingress traefik-tailscale-funnel -o yaml
kubectl -n tailscale get pod --selector=tailscale.com/parent-resource-type=ingress,tailscale.com/parent-resource=traefik-tailscale-funnel,tailscale.com/parent-resource-ns=traefik
kubectl -n tailscale logs <proxy-pod-name> --tail=200
Check:
- ingressClassName is tailscale
- tailscale.com/funnel annotation is present
- ingress status contains the generated hostname and port 443
- proxy logs do not show certificate provisioning or policy errors
Proxy cannot reach Traefik backend¶
kubectl -n traefik get svc traefik -o yaml
kubectl -n tailscale exec -it <proxy-pod-name> -- sh
curl -k https://traefik.traefik.svc.cluster.local:443/ping
Check:
- traefik service still exposes port 443
- no NetworkPolicy blocks traffic from tailscale namespace to traefik namespace
- Traefik itself is healthy and answers on /ping through port 443
Funnel enabled but not reachable from the public internet¶
kubectl -n tailscale logs <proxy-pod-name> --tail=200
kubectl -n traefik get ingress traefik-tailscale-funnel -o wide
Check:
- tailnet nodeAttrs grant funnel to tag:k8s
- HTTPS and MagicDNS are enabled on the tailnet
- the generated ts.net hostname is the one used by Cloudflare upstream configuration
Cloudflare hostname still fails after Funnel is healthy¶
Check outside the cluster:
- the public DNS record is a proxied CNAME to ozilab-edge.
.ts.net - Cloudflare SSL mode is Full (strict)
- the upstream TLS SNI is ozilab-edge.
.ts.net when your plan supports SNI override - the upstream HTTP Host header stays on the public hostname so Traefik routing still matches
Interpret symptoms carefully:
- 526 usually means Cloudflare is validating the wrong hostname or receiving the wrong certificate chain from the origin path.
- a Traefik 404 or default backend usually means Cloudflare reached the Funnel proxy but sent the wrong Host header upstream.
- 522 or 523 usually means the ts.net target is wrong, not yet ready, or not reachable from Cloudflare.
Request mapping to preserve:
| Layer | Expected value |
|---|---|
| Public hostname | app.example.com |
| Cloudflare proxied CNAME target | ozilab-edge. |
| Upstream TLS SNI | ozilab-edge. |
| Upstream HTTP Host header | app.example.com |
| Traefik route matcher | app.example.com |
3. Disaster Recovery¶
Preconditions¶
- Identify whether the outage is Tailscale auth, Funnel exposure, Traefik backend reachability, or Cloudflare origin routing.
- Verify that the previous cloudflared manifests are still available for rollback.
- Verify the Tailscale admin console still has the operator and proxy devices registered.
Stateful workload recovery¶
This service is stateless. Recovery is configuration-focused:
- Restore valid OAuth credentials.
- Restore tailnet tag and Funnel policy.
- Reapply the ozilab overlay.
- Validate the generated ts.net hostname directly.
- Re-enable Cloudflare origin routing to the Tailscale endpoint without changing the application Host header.
Cluster rebuild dependency order¶
- Namespace tailscale with privileged PSA labels
- Tailscale operator deployment and CRDs
- Traefik service availability on port 443
- Funnel ingress resource in the traefik namespace
- Cloudflare DNS and origin configuration
4. Scaling and Resource Management¶
Preferred path: change the overlay values in Git and reconcile through Fleet.
Use these commands to size the problem before changing resources:
kubectl -n tailscale top pod
kubectl -n tailscale get deploy operator -o yaml
kubectl -n tailscale describe deploy operator
Record:
- whether the operator is CPU or memory constrained
- whether additional ingress redundancy is needed via ProxyGroup later
- whether the corporate network path is forcing DERP relay and reducing throughput
5. Maintenance Procedures¶
- Rotate the Tailscale OAuth client.
- Update the local tailscale-operator/overlays/ozilab/.operator-oauth.env file after rotation.
- Review tailnet ACLs and Funnel nodeAttrs.
- Verify the Cloudflare CNAME and origin rule still point to the active ts.net hostname.
- Test direct access to the Funnel hostname and proxied access through Cloudflare after upgrades.
For each task, define:
- Preconditions: valid tailnet admin access and Git access
- Impact window: short ingress interruption if the proxy device is recreated
- Rollback path: revert the manifest change or temporarily switch back to cloudflared
- Validation steps: direct ts.net access, then Cloudflare-hosted domain access
6. Rollback Strategy¶
Rollback path for a failed migration:
- Re-add cloudflared to fleet/layer7/gitrepo-ozilab.yaml.
- Remove tailscale-operator from the same GitRepo path list.
- Delete the traefik-tailscale-funnel ingress after confirming cloudflared is healthy.
- Repoint Cloudflare upstream to the previous cloudflared path.
7. Post-Incident Actions¶
After recovery, always:
- Update CHANGELOG.md if the edge path was manually switched or credentials were rotated.
- Update the service page if Cloudflare integration details or Tailscale policy requirements changed.
- Update this runbook if the failure mode was not covered.
- Capture follow-up work if a ProxyGroup-based HA design becomes necessary.