metallb Runbook¶

Metadata¶

Field	Value
Service	metallb
Criticality	Tier 0
Owner	Platform / Networking owner
Namespace	metallb-system
Clusters	homelab, local, jls
Last validated	2026-04-17
Related service page	../services/metallb.md

Trigger Conditions¶

Services of type LoadBalancer remain pending without an external IP.
A previously reachable LoadBalancer IP stops responding after a rollout or node event.
Traefik or another edge service loses its advertised address.

1. Health Checks¶

Use these commands first to establish scope.

kubectl -n metallb-system get pods -o wide
kubectl -n metallb-system get ipaddresspools,l2advertisements
kubectl get svc -A | grep LoadBalancer
kubectl -n metallb-system logs deploy/controller --tail=200
kubectl -n metallb-system logs daemonset/speaker --tail=200

Probe verification¶

MetalLB health is determined by controller and speaker readiness plus successful address assignment.

kubectl -n metallb-system describe pod -l app.kubernetes.io/name=metallb
kubectl get svc -A | grep LoadBalancer

Record:

whether the controller or speaker pods are failing
whether the service event stream shows allocation failures
whether the issue affects one pool, one service, or the whole cluster

2. Troubleshooting Workflows¶

LoadBalancer service is stuck in Pending¶

kubectl -n <namespace> describe svc <service-name>
kubectl -n metallb-system get ipaddresspool metallb-pool -o yaml
kubectl -n metallb-system logs deploy/controller --tail=200

Check:

pool contains a free address
requested loadBalancerIP matches the active pool
service annotations or requested address do not conflict with the overlay pool

Address is assigned but traffic does not arrive¶

kubectl -n metallb-system logs daemonset/speaker --tail=200
kubectl get nodes -o wide
kubectl -n <namespace> get endpoints <service-name>

Check:

speaker pods are present on the nodes expected to announce the address
L2 network segment is still valid for the advertised address
backend endpoints are healthy and the problem is not in the consumer service itself

Overlay change introduced the wrong pool or version skew¶

kubectl -n metallb-system get ipaddresspools,l2advertisements -o yaml
git diff HEAD~1 -- metallb/

Check:

the active cluster overlay still points at the intended config.yaml
prod legacy overlay did not accidentally diverge further from the shared base
no overlapping or invalid CIDR or range was introduced

3. Disaster Recovery¶

Preconditions¶

Confirm whether the incident is configuration-only or a broader node or network failure.
Verify which overlay owns the active cluster pool.
Confirm that the address block is still routable on the target network.

Stateful workload recovery¶

MetalLB is effectively stateless from the repository perspective.

Reapply the correct overlay.
Confirm controller and speaker pods become Ready.
Verify IPAddressPool and L2Advertisement objects exist.
Re-check affected LoadBalancer services for external IP assignment.

Cluster rebuild dependency order¶

Node networking and L2 connectivity
MetalLB controller and speaker
IPAddressPool and L2Advertisement resources
Consumer LoadBalancer services such as Traefik

4. Scaling and Resource Management¶

Preferred path: use upstream controller defaults unless sustained scale symptoms appear.

Use these commands to size the problem before changing resources:

kubectl -n metallb-system top pod
kubectl -n metallb-system get deploy,daemonset
kubectl -n metallb-system describe deploy controller

Record:

whether the controller is CPU bound
whether speaker scheduling is failing because of node constraints
whether scale changes are necessary or the real issue is just bad pool configuration

5. Maintenance Procedures¶

Review and document every pool change before rollout.
Validate that single-address pools still match the intended public or LAN IP.
Reconcile the legacy prod overlay version with the shared base during future maintenance.

For each task, define:

Preconditions: identify the target cluster and current pool owner
Impact window: LoadBalancer services may flap while pools change
Rollback path: restore previous config.yaml and reapply overlay
Validation steps: affected LoadBalancer services receive the expected IP again

6. Rollback Strategy¶

Document the fastest safe rollback path:

Revert the affected metallb/* overlay revision.
Reapply the previous config.yaml.
Restart or recheck the affected consumer services only if they keep stale loadBalancerIP expectations.

7. Post-Incident Actions¶

After recovery, always:

Update CHANGELOG.md if addresses were reallocated manually or an emergency pool change was applied.
Update the MetalLB service page if pool ranges, cluster mappings, or version baselines changed.
Update this runbook if the incident exposed missing L2, pool, or service-allocation checks.
Capture follow-up work to remove legacy prod version skew.