BGP Unnumbered and L3 to the Host on Linux

These are my notes on implementing rfc7938 and rfc5549 on linux hosts wired up in a spine-leaf network.

BGP-unnumbered fundamentally is an implementation of rfc5549 - ‘Advertising IPv4 Network Layer Reachability Information with an IPv6 Next Hop’. In short, this approach significantly lowers the barrier to entry for building a pure l3 network - servers are simply assigned /32 ipv4 loopback adresses, and peer over unnumbered up network links. This gives us true multipathing and failover without the need for l2 ‘hacks’ such as lacp, spanning tree, mclag, vrrp, etc. Top-of-rack switches become top-of-rack routers, and servers advertise their presence directly into the network.

In the network described here, everything runs Debian and everything runs bgp; spine routers, storage hosts, kubernetes nodes, access routers, etc. Every server is a ’leaf’, plugged into two Lanner NCA-1515 boxes (10.0.0.10, 10.0.0.11) I scored as ‘for parts not working’ on ebay. All this uniformity means I only need one process to deploy everything, one process to upgrade and bounce everything, and one process to monitor everything.

It should go without saying, some sort of config management should be considered a must when attempting this sort of thing. Once grokked, the day-2 operation of this environment is relatively simple.

I find netplan to be the most expressive, template-friendly, and composable tool for advanced network configuration. To use netplan, we need to ensure /etc/network/interfaces.d/, /etc/systemd/network/ and similar are empty, network-manager is absent, and systemd-networkd is enabled.

Here is /etc/netplan/bgp-unnumbered.yaml for k8s/ceph server 10.0.200.2:

network:
  version: 2
  renderer: networkd
  ethernets:
    lo:
      addresses:
        - 10.0.200.2/32
        - 10.0.200.0/32
    enp35s0:
      optional: true
      mtu: 9000
    enp36s0:
      optional: true
      mtu: 9000
  vlans:
    bgpenp35s0:
      link: enp35s0
      id: 10
    bgpenp36s0:
      link: enp36s0
      id: 10
  tunnels:
    vxlan100:
      mode: vxlan
      id: 100
      local: 10.0.200.2
      mac-learning: false
      mtu: 8950
  bridges:
    br-vxlan100:
      interfaces:
        - vxlan100

Here are the important points:

  • enp35s0 and enp36s0 are brought administratively ‘up’ with no ip
  • vlan subinterfaces bgpenp35s0 and bgpenp36s0 are created on tag 10. this is so the spines can run dhcpd on the untagged broadcast domain for provisioning.
  • br-vxlan100 is brought up on vxlan vni 100. flooding is disabled and no remote is defined - bgp evpn will enable endpoint discovery.
  • lo has two /32 ips - 10.0.200.2 is the primary address for the server. 10.0.200.0/32 is duplicated across all nodes to enable anycast HA kubectl access.

net.ipv4.conf.all.forwarding=1 of course is enabled.

Having tried bird and quagga, frr is in my opinion the most full-featured and easy-to-template option for routing on linux. With the interface configuration in place, heres /etc/frr/frr.conf:

log syslog
debug bgp
debug zebra vxlan
debug zebra evpn mh es
debug zebra evpn mh mac
debug zebra evpn mh neigh
frr defaults datacenter
zebra nexthop-group keep 1

router bgp 64791
  bgp router-id 10.0.200.2
  bgp fast-convergence
  bgp bestpath compare-routerid
  bgp bestpath as-path multipath-relax
  neighbor bgpenp35s0 interface remote-as external
  neighbor bgpenp36s0 interface remote-as external
  address-family ipv4 unicast
    neighbor bgpenp35s0 route-map default in
    neighbor bgpenp36s0 route-map default in
    redistribute connected
  address-family l2vpn evpn
    neighbor bgpenp35s0 activate
    neighbor bgpenp36s0 activate
    advertise-all-vni

ip prefix-list p1 permit 10.0.0.0/24 ge 32
ip prefix-list p1 permit 192.168.1.0/24 ge 32
ip prefix-list p1 permit 10.0.100.0/24 ge 32
ip prefix-list p1 permit 10.0.200.0/24 ge 32
ip prefix-list p1 permit 172.16.0.0/16 le 26
ip prefix-list p1 permit 172.30.0.0/16 le 27
ip prefix-list p1 permit 0.0.0.0/0

route-map default permit 10
  match ip address prefix-list p1

Here we’re telling bgp to peer on each vlan interface, advertise all routes, and accept only /32s under 10.0.0.0/24, 10.0.100.0/24, and so on. I should filter the advertised routes too, I haven’t gotten to it.

bgp bestpath as-path multipath-relax allows ecmp over dissimilar, but equal-length AS paths. address-family l2vpn evpn enables evpn for vxlan endpoint advertisement and learning. zebra nexthop-group keep 1 is a workaround for an apparent ecmp / fib synchronization bug

Nathan Hensel

on caving, mountaineering, networking, computing, electronics


2023-07-18