These are my notes on implementing rfc7938 and rfc5549 on linux hosts wired up in a spine-leaf network.
BGP-unnumbered fundamentally is an implementation of rfc5549 - ‘Advertising IPv4 Network Layer Reachability Information with an IPv6 Next Hop’. In short, this approach significantly lowers the barrier to entry for building a pure l3 network - servers are simply assigned /32 ipv4 loopback adresses, and peer over unnumbered up
network links. This gives us true multipathing and failover without the need for l2 ‘hacks’ such as lacp, spanning tree, mclag, vrrp, etc. Top-of-rack switches become top-of-rack routers, and servers advertise their presence directly into the network.
In the network described here, everything runs Debian and everything runs bgp; spine routers, storage hosts, kubernetes nodes, access routers, etc. Every server is a ’leaf’, plugged into two Lanner NCA-1515 boxes (10.0.0.10, 10.0.0.11) I scored as ‘for parts not working’ on ebay. All this uniformity means I only need one process to deploy everything, one process to upgrade and bounce everything, and one process to monitor everything.
It should go without saying, some sort of config management should be considered a must when attempting this sort of thing. Once grokked, the day-2 operation of this environment is relatively simple.
I find netplan to be the most expressive, template-friendly, and composable tool for advanced network configuration. To use netplan, we need to ensure /etc/network/interfaces.d/
, /etc/systemd/network/
and similar are empty, network-manager is absent, and systemd-networkd is enabled.
Here is /etc/netplan/bgp-unnumbered.yaml
for k8s/ceph server 10.0.200.2
:
network:
version: 2
renderer: networkd
ethernets:
lo:
addresses:
- 10.0.200.2/32
- 10.0.200.0/32
enp35s0:
optional: true
mtu: 9000
enp36s0:
optional: true
mtu: 9000
vlans:
bgpenp35s0:
link: enp35s0
id: 10
bgpenp36s0:
link: enp36s0
id: 10
tunnels:
vxlan100:
mode: vxlan
id: 100
local: 10.0.200.2
mac-learning: false
mtu: 8950
bridges:
br-vxlan100:
interfaces:
- vxlan100
Here are the important points:
- enp35s0 and enp36s0 are brought administratively ‘up’ with no ip
- vlan subinterfaces bgpenp35s0 and bgpenp36s0 are created on tag 10. this is so the spines can run dhcpd on the untagged broadcast domain for provisioning.
- br-vxlan100 is brought up on vxlan vni 100. flooding is disabled and no remote is defined - bgp evpn will enable endpoint discovery.
- lo has two /32 ips - 10.0.200.2 is the primary address for the server. 10.0.200.0/32 is duplicated across all nodes to enable anycast HA kubectl access.
net.ipv4.conf.all.forwarding=1
of course is enabled.
Having tried bird and quagga, frr is in my opinion the most full-featured and easy-to-template option for routing on linux. With the interface configuration in place, heres /etc/frr/frr.conf
:
log syslog
debug bgp
debug zebra vxlan
debug zebra evpn mh es
debug zebra evpn mh mac
debug zebra evpn mh neigh
frr defaults datacenter
zebra nexthop-group keep 1
router bgp 64791
bgp router-id 10.0.200.2
bgp fast-convergence
bgp bestpath compare-routerid
bgp bestpath as-path multipath-relax
neighbor bgpenp35s0 interface remote-as external
neighbor bgpenp36s0 interface remote-as external
address-family ipv4 unicast
neighbor bgpenp35s0 route-map default in
neighbor bgpenp36s0 route-map default in
redistribute connected
address-family l2vpn evpn
neighbor bgpenp35s0 activate
neighbor bgpenp36s0 activate
advertise-all-vni
ip prefix-list p1 permit 10.0.0.0/24 ge 32
ip prefix-list p1 permit 192.168.1.0/24 ge 32
ip prefix-list p1 permit 10.0.100.0/24 ge 32
ip prefix-list p1 permit 10.0.200.0/24 ge 32
ip prefix-list p1 permit 172.16.0.0/16 le 26
ip prefix-list p1 permit 172.30.0.0/16 le 27
ip prefix-list p1 permit 0.0.0.0/0
route-map default permit 10
match ip address prefix-list p1
Here we’re telling bgp to peer on each vlan interface, advertise all routes, and accept only /32s under 10.0.0.0/24, 10.0.100.0/24, and so on. I should filter the advertised routes too, I haven’t gotten to it.
bgp bestpath as-path multipath-relax
allows ecmp over dissimilar, but equal-length AS paths.
address-family l2vpn evpn
enables evpn for vxlan endpoint advertisement and learning.
zebra nexthop-group keep 1
is a workaround for an apparent ecmp / fib synchronization bug