LXD Cluster Management with Anycast Routing

I recently found out about lxd’s web interface and had to try it out.

All of my servers have 10.0.200.0/32 set on lo, advertised to the network via bgp. This provides simple load balancing and failover for common host services. LXD implements a well-designed modern api, so anycast should be a good solution for cluster access.

Heres the netplan config for one of these servers:

root@x10slhf-xeon-920ea:~# cat /etc/netplan/bgp-unnumbered.yaml
network:
  version: 2
  renderer: networkd
  ethernets:
    lo:
      addresses:
        - 10.0.200.5/32
        - 10.0.200.0/32
    eno1:
      optional: true
      mtu: 9000
    eno2:
      optional: true
      mtu: 9000
  vlans:
    bgpeno1:
      link: eno1
      id: 10
    bgpeno2:
      link: eno2
      id: 10
  tunnels:
    vxlan100:
      mode: vxlan
      id: 100
      local: 10.0.200.5
      mac-learning: false
      mtu: 8950
  bridges:
    br-vxlan100:
      interfaces: ["vxlan100"]

To serve the lxd frontend on 10.0.200.0, we need to set the core.https_address on each server. We also enable the ui here:

ssh [email protected] -- "snap set lxd ui.enable=true && /snap/bin/lxc config set core.https_address=10.0.200.0:8443 && systemctl reload snap.lxd.daemon"
...

And with that, https://10.0.200.0:8443/ui gets us to the web ui. For the most part, things work as expected. Shorter-lived connections to the API - such as with the lxc command-line client - work very well, but console video feeds and realtime shells occasionally have their connections dropped. Of course, if intermediate routers decide to send our traffic to a host B and the shell we have open is a container on host A - the connection will be dropped if host B is bounced, and potentially dropped again when host B comes back online.

These shortcomings aside, this is a high bang-for-the-buck solution for basic HA for cluster management that allows our hosts to exist in different broadcast domains.

If anything, I may add a simple systemd daemon on each host to withdraw the route when LXD can’t be reached on 127.0.0.1, implying the service itself is failed.

Nathan Hensel

on caving, mountaineering, networking, computing, electronics


2023-07-25