Skip to content

tailscale-operator Runbook

Metadata

Field Value
Service tailscale-operator
Criticality Tier 1
Owner Platform / Networking owner
Namespace tailscale
Clusters ozilab
Last validated 2026-04-22
Related service page ../services/tailscale-operator.md

Trigger Conditions

  • Public ozilab routes stop responding after the Cloudflare to Tailscale cutover.
  • The generated ts.net hostname does not resolve or does not answer on 443.
  • The operator fails to join the tailnet.
  • Funnel is enabled in manifests but not actually advertised publicly.

1. Health Checks

Use these commands first to establish scope.

kubectl -n tailscale get deploy,pod,svc,secret
kubectl -n tailscale get secret operator-oauth
kubectl -n tailscale logs deploy/operator --tail=200
kubectl -n traefik get ingress traefik-tailscale-funnel -o wide
kubectl -n traefik describe ingress traefik-tailscale-funnel
kubectl get ingressclass tailscale

Probe verification

The operator itself is a standard deployment. The generated ingress proxy pods are created dynamically in the tailscale namespace.

kubectl -n tailscale get pods -o wide
kubectl -n tailscale describe pod <operator-or-proxy-pod>

Record:

  • whether the operator is healthy but the ingress proxy is missing
  • whether the ingress has a hostname but no advertised port 443 yet
  • whether the issue is cluster-side or Cloudflare-side

2. Troubleshooting Workflows

Operator fails to authenticate or create proxies

kubectl -n tailscale logs deploy/operator --tail=400
kubectl -n tailscale get secret operator-oauth -o yaml
kubectl -n tailscale describe deploy operator

Check:

  • operator-oauth exists in namespace tailscale
  • operator-oauth contains client_id and client_secret keys
  • OAuth client has Devices Core, Auth Keys, and Services write scopes
  • tag ownership in the tailnet policy allows tag:k8s-operator and tag:k8s

Ingress exists but the proxy does not expose traffic

kubectl -n traefik get ingress traefik-tailscale-funnel -o yaml
kubectl -n tailscale get pod --selector=tailscale.com/parent-resource-type=ingress,tailscale.com/parent-resource=traefik-tailscale-funnel,tailscale.com/parent-resource-ns=traefik
kubectl -n tailscale logs <proxy-pod-name> --tail=200

Check:

  • ingressClassName is tailscale
  • tailscale.com/funnel annotation is present
  • ingress status contains the generated hostname and port 443
  • proxy logs do not show certificate provisioning or policy errors

Proxy cannot reach Traefik backend

kubectl -n traefik get svc traefik -o yaml
kubectl -n tailscale exec -it <proxy-pod-name> -- sh
curl -k https://traefik.traefik.svc.cluster.local:443/ping

Check:

  • traefik service still exposes port 443
  • no NetworkPolicy blocks traffic from tailscale namespace to traefik namespace
  • Traefik itself is healthy and answers on /ping through port 443

Funnel enabled but not reachable from the public internet

kubectl -n tailscale logs <proxy-pod-name> --tail=200
kubectl -n traefik get ingress traefik-tailscale-funnel -o wide

Check:

  • tailnet nodeAttrs grant funnel to tag:k8s
  • HTTPS and MagicDNS are enabled on the tailnet
  • the generated ts.net hostname is the one used by Cloudflare upstream configuration

Cloudflare hostname still fails after Funnel is healthy

Check outside the cluster:

  • the public DNS record is a proxied CNAME to ozilab-edge..ts.net
  • Cloudflare SSL mode is Full (strict)
  • the upstream TLS SNI is ozilab-edge..ts.net when your plan supports SNI override
  • the upstream HTTP Host header stays on the public hostname so Traefik routing still matches

Interpret symptoms carefully:

  • 526 usually means Cloudflare is validating the wrong hostname or receiving the wrong certificate chain from the origin path.
  • a Traefik 404 or default backend usually means Cloudflare reached the Funnel proxy but sent the wrong Host header upstream.
  • 522 or 523 usually means the ts.net target is wrong, not yet ready, or not reachable from Cloudflare.

Request mapping to preserve:

Layer Expected value
Public hostname app.example.com
Cloudflare proxied CNAME target ozilab-edge..ts.net
Upstream TLS SNI ozilab-edge..ts.net
Upstream HTTP Host header app.example.com
Traefik route matcher app.example.com

3. Disaster Recovery

Preconditions

  • Identify whether the outage is Tailscale auth, Funnel exposure, Traefik backend reachability, or Cloudflare origin routing.
  • Verify that the previous cloudflared manifests are still available for rollback.
  • Verify the Tailscale admin console still has the operator and proxy devices registered.

Stateful workload recovery

This service is stateless. Recovery is configuration-focused:

  1. Restore valid OAuth credentials.
  2. Restore tailnet tag and Funnel policy.
  3. Reapply the ozilab overlay.
  4. Validate the generated ts.net hostname directly.
  5. Re-enable Cloudflare origin routing to the Tailscale endpoint without changing the application Host header.

Cluster rebuild dependency order

  1. Namespace tailscale with privileged PSA labels
  2. Tailscale operator deployment and CRDs
  3. Traefik service availability on port 443
  4. Funnel ingress resource in the traefik namespace
  5. Cloudflare DNS and origin configuration

4. Scaling and Resource Management

Preferred path: change the overlay values in Git and reconcile through Fleet.

Use these commands to size the problem before changing resources:

kubectl -n tailscale top pod
kubectl -n tailscale get deploy operator -o yaml
kubectl -n tailscale describe deploy operator

Record:

  • whether the operator is CPU or memory constrained
  • whether additional ingress redundancy is needed via ProxyGroup later
  • whether the corporate network path is forcing DERP relay and reducing throughput

5. Maintenance Procedures

  • Rotate the Tailscale OAuth client.
  • Update the local tailscale-operator/overlays/ozilab/.operator-oauth.env file after rotation.
  • Review tailnet ACLs and Funnel nodeAttrs.
  • Verify the Cloudflare CNAME and origin rule still point to the active ts.net hostname.
  • Test direct access to the Funnel hostname and proxied access through Cloudflare after upgrades.

For each task, define:

  • Preconditions: valid tailnet admin access and Git access
  • Impact window: short ingress interruption if the proxy device is recreated
  • Rollback path: revert the manifest change or temporarily switch back to cloudflared
  • Validation steps: direct ts.net access, then Cloudflare-hosted domain access

6. Rollback Strategy

Rollback path for a failed migration:

  1. Re-add cloudflared to fleet/layer7/gitrepo-ozilab.yaml.
  2. Remove tailscale-operator from the same GitRepo path list.
  3. Delete the traefik-tailscale-funnel ingress after confirming cloudflared is healthy.
  4. Repoint Cloudflare upstream to the previous cloudflared path.

7. Post-Incident Actions

After recovery, always:

  1. Update CHANGELOG.md if the edge path was manually switched or credentials were rotated.
  2. Update the service page if Cloudflare integration details or Tailscale policy requirements changed.
  3. Update this runbook if the failure mode was not covered.
  4. Capture follow-up work if a ProxyGroup-based HA design becomes necessary.