In k3s with BGP-to-the-host i said i’d create a LoadBalancer implemention to work with k3s on nixos, with unnumbered interfaces. l3lb is it.
Essentially all l3lb does is assign /32 addresses defined by LoadBalancer resrouces to lo
on nodes where associated (Ready) pods are running. In turn, frr running on all these k8s nodes is configured to advertise-all-connected in the relevant prefix. This is admittedly a bit of a hack, but it makes for an exceedingly simple solution - no additional peerings need to be configured. Convergeance time is quite good as well.
To dive a little deeper, each node runs an instance of the daemon. The daemon watches for pod or service events of type loadbalancer, and performs a few api calls to figure out if any pods running locally match any loadbalancer definition. It applies any appropriate addresses, and any addresses on lo
that do not match anything are garbage collected. This is a bit heavy handed, because a pod restarting on a node far away will trigger reconciliation on all nodes, but for small clusters i’m fine with this for now. Some skimming tells me metallb has the same drawback. The benefit of this simple design is we dont need to track state anywhere other than whats already assigned, and no additional controller/message queue is required.
Heres what l3lb looks like in use:
[root@k8s-8ae4:~]# systemctl status l3lb
● l3lb.service
Loaded: loaded (/etc/systemd/system/l3lb.service; enabled; preset: ignored)
Active: active (running) since Sat 2025-04-05 05:53:40 UTC; 4 days ago
Invocation: e4f6f1cfaed549528cc635a4f90b3676
Main PID: 1023 (l3lb-start)
IP: 3.6M in, 19.8K out
IO: 171M read, 240.8M written
Tasks: 5 (limit: 74641)
Memory: 263.3M (peak: 467.2M)
CPU: 1min 50.733s
CGroup: /system.slice/l3lb.service
├─1023 /nix/store/8vpg72ik2kgxfj05lc56hkqrdrfl8xi9-bash-5.2p37/bin/bash /nix/store/7nz8pvfiqx52jypbk0smkd6gia3hxfwq-unit-script-l3lb-start/bin/l3lb-start
├─1030 bash /tmp/nix-shell-1030-0/rc
└─1476 python -u main.py
Apr 09 12:37:34 k8s-8ae4 l3lb-start[1476]: assuming address 10.0.100.5
Apr 09 13:53:19 k8s-8ae4 l3lb-start[1476]: forfeiting address 10.0.100.8
Apr 09 13:53:19 k8s-8ae4 l3lb-start[1476]: forfeiting address 10.0.100.7
Apr 09 13:53:41 k8s-8ae4 l3lb-start[1476]: assuming address 10.0.100.6
Apr 09 13:53:41 k8s-8ae4 l3lb-start[1476]: assuming address 10.0.100.8
Apr 09 13:53:42 k8s-8ae4 l3lb-start[1476]: assuming address 10.0.100.7
Apr 09 13:55:09 k8s-8ae4 l3lb-start[1476]: assuming address 10.0.100.4
[root@k8s-8ae4:~]# ip a show lo
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet 10.0.10.0/32 scope global lo
valid_lft forever preferred_lft forever
inet 10.0.10.255/32 scope global lo
valid_lft forever preferred_lft forever
inet 10.0.100.2/32 scope global lo
valid_lft forever preferred_lft forever
inet 10.0.100.5/32 scope global lo
valid_lft forever preferred_lft forever
inet 10.0.100.6/32 scope global lo
valid_lft forever preferred_lft forever
inet 10.0.100.8/32 scope global lo
valid_lft forever preferred_lft forever
inet 10.0.100.7/32 scope global lo
valid_lft forever preferred_lft forever
inet 10.0.100.4/32 scope global lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
I’m deploying it with nk3:
systemd.services.l3lb = let
src = pkgs.fetchgit {
url = "https://github.com/nihr43/k8s-l3-lb.git";
rev = "55e77a8";
sha256 = "9HOyutNu9iDn3dG72tQTt+EpqHBfOkdhaje0uWuG3QQ=";
};
in {
after = [ "k3s.service" ];
requires = [ "k3s.service" ];
wantedBy = [ "default.target" ];
serviceConfig = {
Type = "simple";
WorkingDirectory = "${src}";
Restart = "on-failure";
RestartSec = "10s";
};
environment = {
NIX_PATH = "nixpkgs=${pkgs.path}";
KUBECONFIG = "/etc/rancher/k3s/k3s.yaml";
L3LB_PREFIX = "10.0.100.0/24";
};
script = ''
${pkgs.nix}/bin/nix-shell --run 'python -u main.py'
'';
};
l3lb simply derives its state though the state of the local lo
interface, and the configured prefix.
external_traffic_policy instructs kube-proxy not to do additional load balancing once traffic has shown up at a node. bgp gets traffic to the right nodes in the first place:
resource "kubernetes_service" "registry" {
metadata {
name = "registry"
}
spec {
selector = {
app = "registry"
}
port {
port = "5000"
target_port = "5000"
}
type = "LoadBalancer"
session_affinity = "ClientIP"
load_balancer_ip = "10.0.100.2"
external_traffic_policy = "Local"
}
}