A BGP loadbalancer for Kubernetes

In k3s with BGP-to-the-host i said i’d create a LoadBalancer implemention to work with k3s on nixos, with unnumbered interfaces. l3lb is it.

Essentially all l3lb does is assign /32 addresses defined by LoadBalancer resrouces to lo on nodes where associated (Ready) pods are running. In turn, frr running on all these k8s nodes is configured to advertise-all-connected in the relevant prefix. This is admittedly a bit of a hack, but it makes for an exceedingly simple solution - no additional peerings need to be configured. Convergeance time is quite good as well.

To dive a little deeper, each node runs an instance of the daemon. The daemon watches for pod or service events of type loadbalancer, and performs a few api calls to figure out if any pods running locally match any loadbalancer definition. It applies any appropriate addresses, and any addresses on lo that do not match anything are garbage collected. This is a bit heavy handed, because a pod restarting on a node far away will trigger reconciliation on all nodes, but for small clusters i’m fine with this for now. Some skimming tells me metallb has the same drawback. The benefit of this simple design is we dont need to track state anywhere other than whats already assigned, and no additional controller/message queue is required.

Heres what l3lb looks like in use:

[root@k8s-8ae4:~]# systemctl status l3lb
● l3lb.service
     Loaded: loaded (/etc/systemd/system/l3lb.service; enabled; preset: ignored)
     Active: active (running) since Sat 2025-04-05 05:53:40 UTC; 4 days ago
 Invocation: e4f6f1cfaed549528cc635a4f90b3676
   Main PID: 1023 (l3lb-start)
         IP: 3.6M in, 19.8K out
         IO: 171M read, 240.8M written
      Tasks: 5 (limit: 74641)
     Memory: 263.3M (peak: 467.2M)
        CPU: 1min 50.733s
     CGroup: /system.slice/l3lb.service
             ├─1023 /nix/store/8vpg72ik2kgxfj05lc56hkqrdrfl8xi9-bash-5.2p37/bin/bash /nix/store/7nz8pvfiqx52jypbk0smkd6gia3hxfwq-unit-script-l3lb-start/bin/l3lb-start
             ├─1030 bash /tmp/nix-shell-1030-0/rc
             └─1476 python -u main.py

Apr 09 12:37:34 k8s-8ae4 l3lb-start[1476]: assuming address 10.0.100.5
Apr 09 13:53:19 k8s-8ae4 l3lb-start[1476]: forfeiting address 10.0.100.8
Apr 09 13:53:19 k8s-8ae4 l3lb-start[1476]: forfeiting address 10.0.100.7
Apr 09 13:53:41 k8s-8ae4 l3lb-start[1476]: assuming address 10.0.100.6
Apr 09 13:53:41 k8s-8ae4 l3lb-start[1476]: assuming address 10.0.100.8
Apr 09 13:53:42 k8s-8ae4 l3lb-start[1476]: assuming address 10.0.100.7
Apr 09 13:55:09 k8s-8ae4 l3lb-start[1476]: assuming address 10.0.100.4
[root@k8s-8ae4:~]# ip a show lo
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet 10.0.10.0/32 scope global lo
       valid_lft forever preferred_lft forever
    inet 10.0.10.255/32 scope global lo
       valid_lft forever preferred_lft forever
    inet 10.0.100.2/32 scope global lo
       valid_lft forever preferred_lft forever
    inet 10.0.100.5/32 scope global lo
       valid_lft forever preferred_lft forever
    inet 10.0.100.6/32 scope global lo
       valid_lft forever preferred_lft forever
    inet 10.0.100.8/32 scope global lo
       valid_lft forever preferred_lft forever
    inet 10.0.100.7/32 scope global lo
       valid_lft forever preferred_lft forever
    inet 10.0.100.4/32 scope global lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host noprefixroute
       valid_lft forever preferred_lft forever

I’m deploying it with nk3:

  systemd.services.l3lb = let
    src = pkgs.fetchgit {
      url = "https://github.com/nihr43/k8s-l3-lb.git";
      rev = "55e77a8";
      sha256 = "9HOyutNu9iDn3dG72tQTt+EpqHBfOkdhaje0uWuG3QQ=";
    };
  in {
    after = [ "k3s.service" ];
    requires = [ "k3s.service" ];
    wantedBy = [ "default.target" ];
    serviceConfig = {
      Type = "simple";
      WorkingDirectory = "${src}";
      Restart = "on-failure";
      RestartSec = "10s";
    };
    environment = {
      NIX_PATH = "nixpkgs=${pkgs.path}";
      KUBECONFIG = "/etc/rancher/k3s/k3s.yaml";
      L3LB_PREFIX = "10.0.100.0/24";
    };
    script = ''
      ${pkgs.nix}/bin/nix-shell --run 'python -u main.py'
    '';
  };

l3lb simply derives its state though the state of the local lo interface, and the configured prefix.

external_traffic_policy instructs kube-proxy not to do additional load balancing once traffic has shown up at a node. bgp gets traffic to the right nodes in the first place:

resource "kubernetes_service" "registry" {
  metadata {
    name = "registry"
  }
  spec {
    selector = {
      app = "registry"
    }
    port {
      port        = "5000"
      target_port = "5000"
    }
    type                    = "LoadBalancer"
    session_affinity        = "ClientIP"
    load_balancer_ip        = "10.0.100.2"
    external_traffic_policy = "Local"
  }
}

2025-04-08