LXD Cluster Management with Anycast Routing

I recently found out about lxd’s web interface and had to try it out.

All of my servers have 10.0.200.0/32 set on lo, advertised to the network via bgp. This provides simple load balancing and failover for common host services. LXD implements a well-designed modern api, so anycast should be a good solution for cluster access.

Heres the netplan config for one of these servers:

root@x10slhf-xeon-920ea:~# cat /etc/netplan/bgp-unnumbered.yaml
network:
  version: 2
  renderer: networkd
  ethernets:
    lo:
      addresses:
        - 10.0.200.5/32
        - 10.0.200.0/32
    eno1:
      optional: true
      mtu: 9000
    eno2:
      optional: true
      mtu: 9000
  vlans:
    bgpeno1:
      link: eno1
      id: 10
    bgpeno2:
      link: eno2
      id: 10
  tunnels:
    vxlan100:
      mode: vxlan
      id: 100
      local: 10.0.200.5
      mac-learning: false
      mtu: 8950
  bridges:
    br-vxlan100:
      interfaces: ["vxlan100"]

To serve the lxd frontend on 10.0.200.0, we need to set the core.https_address on each server. We also enable the ui here:

ssh [email protected] -- "snap set lxd ui.enable=true && /snap/bin/lxc config set core.https_address=10.0.200.0:8443 && systemctl reload snap.lxd.daemon"
...

And with that, https://10.0.200.0:8443/ui gets us to the web ui. For the most part, things work as expected. Shorter-lived connections to the API - such as with the lxc command-line client - work very well, but console video feeds and realtime shells occasionally have their connections dropped. Of course, if intermediate routers decide to send our traffic to a host B and the shell we have open is a container on host A - the connection will be dropped if host B is bounced, and potentially dropped again when host B comes back online.

These shortcomings aside, this is a high bang-for-the-buck solution for basic HA for cluster management that allows our hosts to exist in different broadcast domains.

If anything, I may add a simple systemd daemon on each host to withdraw the route when LXD can’t be reached on 127.0.0.1, implying the service itself is failed.


2023-07-25