tailscale-operator Runbook¶

Metadata¶

Field	Value
Service	tailscale-operator
Criticality	Tier 1
Owner	Platform / Networking owner
Namespace	tailscale
Clusters	ozilab
Last validated	2026-04-22
Related service page	../services/tailscale-operator.md

Trigger Conditions¶

Public ozilab routes stop responding after the Cloudflare to Tailscale cutover.
The generated ts.net hostname does not resolve or does not answer on 443.
The operator fails to join the tailnet.
Funnel is enabled in manifests but not actually advertised publicly.

1. Health Checks¶

Use these commands first to establish scope.

kubectl -n tailscale get deploy,pod,svc,secret
kubectl -n tailscale get secret operator-oauth
kubectl -n tailscale logs deploy/operator --tail=200
kubectl -n traefik get ingress traefik-tailscale-funnel -o wide
kubectl -n traefik describe ingress traefik-tailscale-funnel
kubectl get ingressclass tailscale

Probe verification¶

The operator itself is a standard deployment. The generated ingress proxy pods are created dynamically in the tailscale namespace.

kubectl -n tailscale get pods -o wide
kubectl -n tailscale describe pod <operator-or-proxy-pod>

Record:

whether the operator is healthy but the ingress proxy is missing
whether the ingress has a hostname but no advertised port 443 yet
whether the issue is cluster-side or Cloudflare-side

2. Troubleshooting Workflows¶

Operator fails to authenticate or create proxies¶

kubectl -n tailscale logs deploy/operator --tail=400
kubectl -n tailscale get secret operator-oauth -o yaml
kubectl -n tailscale describe deploy operator

Check:

operator-oauth exists in namespace tailscale
operator-oauth contains client_id and client_secret keys
OAuth client has Devices Core, Auth Keys, and Services write scopes
tag ownership in the tailnet policy allows tag:k8s-operator and tag:k8s

Ingress exists but the proxy does not expose traffic¶

kubectl -n traefik get ingress traefik-tailscale-funnel -o yaml
kubectl -n tailscale get pod --selector=tailscale.com/parent-resource-type=ingress,tailscale.com/parent-resource=traefik-tailscale-funnel,tailscale.com/parent-resource-ns=traefik
kubectl -n tailscale logs <proxy-pod-name> --tail=200

Check:

ingressClassName is tailscale
tailscale.com/funnel annotation is present
ingress status contains the generated hostname and port 443
proxy logs do not show certificate provisioning or policy errors

Proxy cannot reach Traefik backend¶

kubectl -n traefik get svc traefik -o yaml
kubectl -n tailscale exec -it <proxy-pod-name> -- sh
curl -k https://traefik.traefik.svc.cluster.local:443/ping

Check:

traefik service still exposes port 443
no NetworkPolicy blocks traffic from tailscale namespace to traefik namespace
Traefik itself is healthy and answers on /ping through port 443

Funnel enabled but not reachable from the public internet¶

kubectl -n tailscale logs <proxy-pod-name> --tail=200
kubectl -n traefik get ingress traefik-tailscale-funnel -o wide

Check:

tailnet nodeAttrs grant funnel to tag:k8s
HTTPS and MagicDNS are enabled on the tailnet
the generated ts.net hostname is the one used by Cloudflare upstream configuration

Cloudflare hostname still fails after Funnel is healthy¶

Check outside the cluster:

the public DNS record is a proxied CNAME to ozilab-edge..ts.net
Cloudflare SSL mode is Full (strict)
the upstream TLS SNI is ozilab-edge..ts.net when your plan supports SNI override
the upstream HTTP Host header stays on the public hostname so Traefik routing still matches

Interpret symptoms carefully:

526 usually means Cloudflare is validating the wrong hostname or receiving the wrong certificate chain from the origin path.
a Traefik 404 or default backend usually means Cloudflare reached the Funnel proxy but sent the wrong Host header upstream.
522 or 523 usually means the ts.net target is wrong, not yet ready, or not reachable from Cloudflare.

Request mapping to preserve:

Layer	Expected value
Public hostname	app.example.com
Cloudflare proxied CNAME target	ozilab-edge..ts.net
Upstream TLS SNI	ozilab-edge..ts.net
Upstream HTTP Host header	app.example.com
Traefik route matcher	app.example.com

3. Disaster Recovery¶

Preconditions¶

Identify whether the outage is Tailscale auth, Funnel exposure, Traefik backend reachability, or Cloudflare origin routing.
Verify that the previous cloudflared manifests are still available for rollback.
Verify the Tailscale admin console still has the operator and proxy devices registered.

Stateful workload recovery¶

This service is stateless. Recovery is configuration-focused:

Restore valid OAuth credentials.
Restore tailnet tag and Funnel policy.
Reapply the ozilab overlay.
Validate the generated ts.net hostname directly.
Re-enable Cloudflare origin routing to the Tailscale endpoint without changing the application Host header.

Cluster rebuild dependency order¶

Namespace tailscale with privileged PSA labels
Tailscale operator deployment and CRDs
Traefik service availability on port 443
Funnel ingress resource in the traefik namespace
Cloudflare DNS and origin configuration

4. Scaling and Resource Management¶

Preferred path: change the overlay values in Git and reconcile through Fleet.

Use these commands to size the problem before changing resources:

kubectl -n tailscale top pod
kubectl -n tailscale get deploy operator -o yaml
kubectl -n tailscale describe deploy operator

Record:

whether the operator is CPU or memory constrained
whether additional ingress redundancy is needed via ProxyGroup later
whether the corporate network path is forcing DERP relay and reducing throughput

5. Maintenance Procedures¶

Rotate the Tailscale OAuth client.
Update the local tailscale-operator/overlays/ozilab/.operator-oauth.env file after rotation.
Review tailnet ACLs and Funnel nodeAttrs.
Verify the Cloudflare CNAME and origin rule still point to the active ts.net hostname.
Test direct access to the Funnel hostname and proxied access through Cloudflare after upgrades.

For each task, define:

Preconditions: valid tailnet admin access and Git access
Impact window: short ingress interruption if the proxy device is recreated
Rollback path: revert the manifest change or temporarily switch back to cloudflared
Validation steps: direct ts.net access, then Cloudflare-hosted domain access

6. Rollback Strategy¶

Rollback path for a failed migration:

Re-add cloudflared to fleet/layer7/gitrepo-ozilab.yaml.
Remove tailscale-operator from the same GitRepo path list.
Delete the traefik-tailscale-funnel ingress after confirming cloudflared is healthy.
Repoint Cloudflare upstream to the previous cloudflared path.

7. Post-Incident Actions¶

After recovery, always:

Update CHANGELOG.md if the edge path was manually switched or credentials were rotated.
Update the service page if Cloudflare integration details or Tailscale policy requirements changed.
Update this runbook if the failure mode was not covered.
Capture follow-up work if a ProxyGroup-based HA design becomes necessary.