Skip to content

metallb Runbook

Metadata

Field Value
Service metallb
Criticality Tier 0
Owner Platform / Networking owner
Namespace metallb-system
Clusters homelab, local, jls
Last validated 2026-04-17
Related service page ../services/metallb.md

Trigger Conditions

  • Services of type LoadBalancer remain pending without an external IP.
  • A previously reachable LoadBalancer IP stops responding after a rollout or node event.
  • Traefik or another edge service loses its advertised address.

1. Health Checks

Use these commands first to establish scope.

kubectl -n metallb-system get pods -o wide
kubectl -n metallb-system get ipaddresspools,l2advertisements
kubectl get svc -A | grep LoadBalancer
kubectl -n metallb-system logs deploy/controller --tail=200
kubectl -n metallb-system logs daemonset/speaker --tail=200

Probe verification

MetalLB health is determined by controller and speaker readiness plus successful address assignment.

kubectl -n metallb-system describe pod -l app.kubernetes.io/name=metallb
kubectl get svc -A | grep LoadBalancer

Record:

  • whether the controller or speaker pods are failing
  • whether the service event stream shows allocation failures
  • whether the issue affects one pool, one service, or the whole cluster

2. Troubleshooting Workflows

LoadBalancer service is stuck in Pending

kubectl -n <namespace> describe svc <service-name>
kubectl -n metallb-system get ipaddresspool metallb-pool -o yaml
kubectl -n metallb-system logs deploy/controller --tail=200

Check:

  • pool contains a free address
  • requested loadBalancerIP matches the active pool
  • service annotations or requested address do not conflict with the overlay pool

Address is assigned but traffic does not arrive

kubectl -n metallb-system logs daemonset/speaker --tail=200
kubectl get nodes -o wide
kubectl -n <namespace> get endpoints <service-name>

Check:

  • speaker pods are present on the nodes expected to announce the address
  • L2 network segment is still valid for the advertised address
  • backend endpoints are healthy and the problem is not in the consumer service itself

Overlay change introduced the wrong pool or version skew

kubectl -n metallb-system get ipaddresspools,l2advertisements -o yaml
git diff HEAD~1 -- metallb/

Check:

  • the active cluster overlay still points at the intended config.yaml
  • prod legacy overlay did not accidentally diverge further from the shared base
  • no overlapping or invalid CIDR or range was introduced

3. Disaster Recovery

Preconditions

  • Confirm whether the incident is configuration-only or a broader node or network failure.
  • Verify which overlay owns the active cluster pool.
  • Confirm that the address block is still routable on the target network.

Stateful workload recovery

MetalLB is effectively stateless from the repository perspective.

  1. Reapply the correct overlay.
  2. Confirm controller and speaker pods become Ready.
  3. Verify IPAddressPool and L2Advertisement objects exist.
  4. Re-check affected LoadBalancer services for external IP assignment.

Cluster rebuild dependency order

  1. Node networking and L2 connectivity
  2. MetalLB controller and speaker
  3. IPAddressPool and L2Advertisement resources
  4. Consumer LoadBalancer services such as Traefik

4. Scaling and Resource Management

Preferred path: use upstream controller defaults unless sustained scale symptoms appear.

Use these commands to size the problem before changing resources:

kubectl -n metallb-system top pod
kubectl -n metallb-system get deploy,daemonset
kubectl -n metallb-system describe deploy controller

Record:

  • whether the controller is CPU bound
  • whether speaker scheduling is failing because of node constraints
  • whether scale changes are necessary or the real issue is just bad pool configuration

5. Maintenance Procedures

  • Review and document every pool change before rollout.
  • Validate that single-address pools still match the intended public or LAN IP.
  • Reconcile the legacy prod overlay version with the shared base during future maintenance.

For each task, define:

  • Preconditions: identify the target cluster and current pool owner
  • Impact window: LoadBalancer services may flap while pools change
  • Rollback path: restore previous config.yaml and reapply overlay
  • Validation steps: affected LoadBalancer services receive the expected IP again

6. Rollback Strategy

Document the fastest safe rollback path:

  • Revert the affected metallb/* overlay revision.
  • Reapply the previous config.yaml.
  • Restart or recheck the affected consumer services only if they keep stale loadBalancerIP expectations.

7. Post-Incident Actions

After recovery, always:

  1. Update CHANGELOG.md if addresses were reallocated manually or an emergency pool change was applied.
  2. Update the MetalLB service page if pool ranges, cluster mappings, or version baselines changed.
  3. Update this runbook if the incident exposed missing L2, pool, or service-allocation checks.
  4. Capture follow-up work to remove legacy prod version skew.