metallb Runbook¶
Metadata¶
| Field | Value |
|---|---|
| Service | metallb |
| Criticality | Tier 0 |
| Owner | Platform / Networking owner |
| Namespace | metallb-system |
| Clusters | homelab, local, jls |
| Last validated | 2026-04-17 |
| Related service page | ../services/metallb.md |
Trigger Conditions¶
- Services of type LoadBalancer remain pending without an external IP.
- A previously reachable LoadBalancer IP stops responding after a rollout or node event.
- Traefik or another edge service loses its advertised address.
1. Health Checks¶
Use these commands first to establish scope.
kubectl -n metallb-system get pods -o wide
kubectl -n metallb-system get ipaddresspools,l2advertisements
kubectl get svc -A | grep LoadBalancer
kubectl -n metallb-system logs deploy/controller --tail=200
kubectl -n metallb-system logs daemonset/speaker --tail=200
Probe verification¶
MetalLB health is determined by controller and speaker readiness plus successful address assignment.
kubectl -n metallb-system describe pod -l app.kubernetes.io/name=metallb
kubectl get svc -A | grep LoadBalancer
Record:
- whether the controller or speaker pods are failing
- whether the service event stream shows allocation failures
- whether the issue affects one pool, one service, or the whole cluster
2. Troubleshooting Workflows¶
LoadBalancer service is stuck in Pending¶
kubectl -n <namespace> describe svc <service-name>
kubectl -n metallb-system get ipaddresspool metallb-pool -o yaml
kubectl -n metallb-system logs deploy/controller --tail=200
Check:
- pool contains a free address
- requested loadBalancerIP matches the active pool
- service annotations or requested address do not conflict with the overlay pool
Address is assigned but traffic does not arrive¶
kubectl -n metallb-system logs daemonset/speaker --tail=200
kubectl get nodes -o wide
kubectl -n <namespace> get endpoints <service-name>
Check:
- speaker pods are present on the nodes expected to announce the address
- L2 network segment is still valid for the advertised address
- backend endpoints are healthy and the problem is not in the consumer service itself
Overlay change introduced the wrong pool or version skew¶
Check:
- the active cluster overlay still points at the intended config.yaml
- prod legacy overlay did not accidentally diverge further from the shared base
- no overlapping or invalid CIDR or range was introduced
3. Disaster Recovery¶
Preconditions¶
- Confirm whether the incident is configuration-only or a broader node or network failure.
- Verify which overlay owns the active cluster pool.
- Confirm that the address block is still routable on the target network.
Stateful workload recovery¶
MetalLB is effectively stateless from the repository perspective.
- Reapply the correct overlay.
- Confirm controller and speaker pods become Ready.
- Verify IPAddressPool and L2Advertisement objects exist.
- Re-check affected LoadBalancer services for external IP assignment.
Cluster rebuild dependency order¶
- Node networking and L2 connectivity
- MetalLB controller and speaker
- IPAddressPool and L2Advertisement resources
- Consumer LoadBalancer services such as Traefik
4. Scaling and Resource Management¶
Preferred path: use upstream controller defaults unless sustained scale symptoms appear.
Use these commands to size the problem before changing resources:
kubectl -n metallb-system top pod
kubectl -n metallb-system get deploy,daemonset
kubectl -n metallb-system describe deploy controller
Record:
- whether the controller is CPU bound
- whether speaker scheduling is failing because of node constraints
- whether scale changes are necessary or the real issue is just bad pool configuration
5. Maintenance Procedures¶
- Review and document every pool change before rollout.
- Validate that single-address pools still match the intended public or LAN IP.
- Reconcile the legacy prod overlay version with the shared base during future maintenance.
For each task, define:
- Preconditions: identify the target cluster and current pool owner
- Impact window: LoadBalancer services may flap while pools change
- Rollback path: restore previous config.yaml and reapply overlay
- Validation steps: affected LoadBalancer services receive the expected IP again
6. Rollback Strategy¶
Document the fastest safe rollback path:
- Revert the affected metallb/* overlay revision.
- Reapply the previous config.yaml.
- Restart or recheck the affected consumer services only if they keep stale loadBalancerIP expectations.
7. Post-Incident Actions¶
After recovery, always:
- Update CHANGELOG.md if addresses were reallocated manually or an emergency pool change was applied.
- Update the MetalLB service page if pool ranges, cluster mappings, or version baselines changed.
- Update this runbook if the incident exposed missing L2, pool, or service-allocation checks.
- Capture follow-up work to remove legacy prod version skew.