These are my notes on getting EVPN Multihoming working on x86 Debian routers.
I have a number of servers running unnumbered bgp-to-the-host, plugged into two spine routers each. The routers themselves are low power denverton c3000 boxes; everything runs frr. This is a poor-mans clos datacenter network; since properly licensed ONIE hardware seems to be unobtainium for personal use - besides, a pair of surplus datacenter switches would certainly not fit in my power budget.
Beside the differences in interface configuration, most of the information on this page should be transferrable to Cumulus linux or similar.
With the dataplane ‘solved’, there is still a need for a basic (but multi-homed) L2 segment in the rack for things like IPMI and UPS access. Turns out, the modern bgp-evpn datacenter has an open solution for this - EVPN Multihoming. In evpn-mh, Ethernet Segment Ids inform the evpn control plane of L2 segments with links to multiple vxlan VTEPS (virtual tunnel endpoints) - preventing loops and enabling active-active multipathing. This is typically used as a replacement for MLAG from leafs to servers, but it should also work from spines to an l2 leaf switch. All the switch needs to support is 802.3ad LACP.
What all this means is I can plug a cheap (low power) prosumer switch into both the spines, configure the vxlan and ethernet segment on the relevant ports on each spine, and we will get multipathing and failover access to the switch - no additional hardware needed. The vxlan still needs a north/south gateway somewhere in the network; this page will not cover this. (future work - anycast vxlan gateway on linux?)
Heres what it all looks like:
There is a ‘blue’ spine and a ‘yellow’ spine. A blue/yellow pair plug into ports 23 and 24 on the switch. Switch is a Procurve 1800-24g - easily available for < 50$ on ebay. Bread ties aren’t my cable management tool of choice btw; they came on the cables, so may as well reduce and reuse.
Here is a logical representation. Green links represent ipmi and ups access:
Spines are 10.0.0.10 and 10.0.0.11. Here is netplan.yml from 10.0.0.10.
The important bits here are enp2s0f0, vxlan100, br-vxlan100, and vxlan100-access. The switch is plugged into enp2s0f0
on each spine. I did try applying the frr evpn-mh config directly to enp2s0f0
, it did not work; they need to be single port bonds.
network:
version: 2
renderer: networkd
ethernets:
lo:
addresses:
- 10.0.0.10/32
enp2s0f0:
dhcp4: no
enp2s0f1:
addresses: ["172.30.201.129/27"]
optional: true
mtu: 9000
enp2s0f2:
addresses: ["172.30.92.161/27"]
optional: true
mtu: 9000
enp2s0f3:
addresses: ["172.30.188.33/27"]
optional: true
mtu: 9000
enp6s0f0:
addresses: ["172.30.242.97/27"]
optional: true
mtu: 9000
enp6s0f1:
addresses: ["172.30.241.33/27"]
optional: true
mtu: 9000
enp8s0f0:
addresses: ["172.30.34.65/27"]
optional: true
mtu: 9000
enp8s0f1:
addresses: ["172.30.160.97/27"]
optional: true
mtu: 9000
vlans:
bgpenp2s0f1:
link: enp2s0f1
id: 10
bgpenp2s0f2:
link: enp2s0f2
id: 10
bgpenp2s0f3:
link: enp2s0f3
id: 10
bgpenp6s0f0:
link: enp6s0f0
id: 10
bgpenp6s0f1:
link: enp6s0f1
id: 10
bgpenp8s0f0:
link: enp8s0f0
id: 10
bgpenp8s0f1:
link: enp8s0f1
id: 10
tunnels:
vxlan100:
mode: vxlan
id: 100
local: 10.0.0.10
mac-learning: false
mtu: 8950
bonds:
vxlan100-access:
macaddress: 00:00:10:00:00:10
dhcp4: no
interfaces:
- enp2s0f0
parameters:
mode: 802.3ad
bridges:
br-dhcpd:
addresses: ["172.30.11.1/27"]
br-vxlan100:
interfaces:
- vxlan100
- vxlan100-access
In summary, 10.0.0.10 is the routers primary ip and bgp router-id. vxlan100 is the VTEP. It is bridged to br-vxlan100 for farther use. enp2s0f0 is brought up with no ip. vxlan100-access is brought up as an lacp bond containing enp2s0f0. vxlan100-access is bridged to br-vxlan100. The ethernet segment sys mac is assigned to vxlan100-access and is duplicated across all instances.
Here is /etc/frr/frr.conf
on 10.0.0.10. The important part here is interface vxlan100-access
, and of course address-family l2vpn evpn
et al. The same es-id
and es-sys-mac
are configured on vxlan100-access of 10.0.0.11.
log syslog
debug bgp
debug zebra vxlan
debug zebra evpn mh es
debug zebra evpn mh mac
debug zebra evpn mh neigh
frr defaults datacenter
zebra nexthop-group keep 1
interface vxlan100-access
evpn mh es-id 10
evpn mh es-sys-mac 00:00:10:00:00:10
evpn mh uplink
router bgp 64512
bgp router-id 10.0.0.10
bgp fast-convergence
bgp bestpath compare-routerid
bgp bestpath as-path multipath-relax
neighbor bgpenp2s0f1 interface remote-as external
neighbor bgpenp2s0f2 interface remote-as external
neighbor bgpenp2s0f3 interface remote-as external
neighbor bgpenp6s0f0 interface remote-as external
neighbor bgpenp6s0f1 interface remote-as external
neighbor bgpenp8s0f0 interface remote-as external
neighbor bgpenp8s0f1 interface remote-as external
address-family ipv4 unicast
neighbor bgpenp2s0f1 route-map default in
neighbor bgpenp2s0f2 route-map default in
neighbor bgpenp2s0f3 route-map default in
neighbor bgpenp6s0f0 route-map default in
neighbor bgpenp6s0f1 route-map default in
neighbor bgpenp8s0f0 route-map default in
neighbor bgpenp8s0f1 route-map default in
redistribute connected
address-family l2vpn evpn
neighbor bgpenp2s0f1 activate
neighbor bgpenp2s0f2 activate
neighbor bgpenp2s0f3 activate
neighbor bgpenp6s0f0 activate
neighbor bgpenp6s0f1 activate
neighbor bgpenp8s0f0 activate
neighbor bgpenp8s0f1 activate
advertise-all-vni
# match /32s under 10.0.0.0/24. will not match 10.0.0.1/24-10.0.0.254/24
ip prefix-list p1 permit 10.0.0.0/24 ge 32
ip prefix-list p1 permit 192.168.1.0/24 ge 32
ip prefix-list p1 permit 10.0.100.0/24 ge 32
ip prefix-list p1 permit 10.0.200.0/24 ge 32
ip prefix-list p1 permit 172.16.0.0/16 le 26
ip prefix-list p1 permit 172.30.0.0/16 le 27
ip prefix-list p1 permit 0.0.0.0/0
route-map default permit 10
match ip address prefix-list p1
If its all working, we will see the Ethernet Segment and associated VTEPs from another leaf in the network:
x10slhf-xeon-9c3ab# show evpn es
Type: B bypass, L local, R remote, N non-DF
ESI Type ES-IF VTEPs
03:00:00:10:00:00:10:00:00:0a R - 10.0.0.10,10.0.0.11
We can also see which macs are bound to the Ethernet Segment 03:00:00:10:00:00:10:00:00:0a
:
x10slhf-xeon-9c3ab# show evpn mac vni 100
Number of MACs (local and remote) known for this VNI: 7
Flags: N=sync-neighs, I=local-inactive, P=peer-active, X=peer-proxy
MAC Type Flags Intf/Remote ES/VTEP VLAN Seq #'s
d0:50:99:f2:39:e0 remote 03:00:00:10:00:00:10:00:00:0a 0/0
00:40:9d:43:35:97 remote 03:00:00:10:00:00:10:00:00:0a 0/0
3c:ec:ef:8c:1d:63 remote 03:00:00:10:00:00:10:00:00:0a 0/0
3c:ec:ef:8c:1e:0a remote 03:00:00:10:00:00:10:00:00:0a 0/0
a8:a1:59:08:bb:cb remote 03:00:00:10:00:00:10:00:00:0a 0/0
ee:2a:d0:35:be:dd remote 03:00:00:10:00:00:10:00:00:0a 0/0
00:1b:3f:9d:5d:c0 remote 03:00:00:10:00:00:10:00:00:0a 0/0
From the spines themselves, the ES and macs report as directly attached to the access interface:
nca1515-denverton-1983d# sho evpn es
Type: B bypass, L local, R remote, N non-DF
ESI Type ES-IF VTEPs
03:00:00:10:00:00:10:00:00:0a L vxlan100-access
nca1515-denverton-1983d# sho evpn mac vni 100
Number of MACs (local and remote) known for this VNI: 6
Flags: N=sync-neighs, I=local-inactive, P=peer-active, X=peer-proxy
MAC Type Flags Intf/Remote ES/VTEP VLAN Seq #'s
d0:50:99:f2:39:e0 local vxlan100-access 1 0/0
00:40:9d:43:35:97 local vxlan100-access 1 0/0
3c:ec:ef:8c:1d:63 local vxlan100-access 1 0/0
a8:a1:59:08:bb:cb local vxlan100-access 1 0/0
3c:ec:ef:8c:1e:0a local vxlan100-access 1 0/0
00:1b:3f:9d:5d:c0 local vxlan100-access 1 0/0
With all that in place, I can reboot one spine at a time without dropping ipmi remote console sessions:
rebooting spine .10:
and spine .11:
Watching tcpdump on each access port, multipathing is clearly working as intended. I also have telegraf shipping metrics to influxdb from each spine (this is the benefit of linux routing; be it ONIE or homebrew, we can run our standard devops stack). Here are tx and rx of each access port after running ipmi consoles for about an hour:
Heres evidence that we are indeed fooling the switch into thinking there is a single LACP peer on the other side:
I am of course not writing all of these configs by hand. Here is an ansible snippet for frr.conf
:
{% if vxlan_access is defined %}
{% for i in vxlan_access %}
interface vxlan{{i.vni}}-access
evpn mh es-id {{i.es_id}}
evpn mh es-sys-mac {{i.es_sys_mac}}
evpn mh uplink
{% endfor %}
{% endif %}
router bgp {% if asn is defined %}{{asn}}{% else %}{{ (1121 | random(seed=inventory_hostname)) + 64512 }}{% endif %}
bgp router-id {{ router_ip }}
bgp fast-convergence
bgp bestpath compare-routerid
bgp bestpath as-path multipath-relax
{% for i in interfaces|sort %}
neighbor bgp{{i}} interface remote-as external
{% endfor %}
address-family ipv4 unicast
{% for i in interfaces|sort %}
neighbor bgp{{i}} route-map default in
{% endfor %}
redistribute connected
{% if router_advertise is defined %}
{% for i in router_advertise|sort %}
network {{i}}
{% endfor %}
{% endif %}
address-family l2vpn evpn
{% for i in interfaces|sort %}
neighbor bgp{{i}} activate
{% endfor %}
advertise-all-vni
and netplan.yml:
network:
version: 2
renderer: networkd
ethernets:
lo:
addresses:
- {{router_ip}}/32
{% if anycast_addresses is defined %}
{% for i in anycast_addresses %}
- {{i}}
{% endfor %}
{% endif %}
{% if vxlan_access is defined %}
{% for i in vxlan_access %}
{{i.port}}:
dhcp4: no
{% endfor %}
{% endif %}
{% for i in interfaces|sort %}
{{i}}:
{% if l2_access is defined %}
addresses: ["172.30.{{ 255 | random(seed=inventory_hostname+i)}}.{{ (8 | random(seed=inventory_hostname+i)*32) + 1}}/27"]
{% endif %}
optional: true
mtu: 9000
{% endfor %}
vlans:
{% for i in interfaces|sort %}
bgp{{i}}:
link: {{i}}
id: 10
{% endfor %}
{% if vxlans is defined %}
tunnels:
{% for v in vxlans %}
vxlan{{v}}:
mode: vxlan
id: {{v}}
local: {{router_ip}}
mac-learning: false
mtu: 8950
{% endfor %}
{% endif %}
{% if vxlan_access is defined %}
bonds:
{% for i in vxlan_access %}
vxlan{{i.vni}}-access:
macaddress: {{i.es_sys_mac}}
dhcp4: no
interfaces:
- {{i.port}}
parameters:
mode: 802.3ad
{% endfor %}
{% endif %}
{% if l2_access is defined or vxlans is defined%}
bridges:
{% endif %}
{% if l2_access is defined %}
br-dhcpd:
addresses: ["172.30.{{ 255 | random(seed=inventory_hostname)}}.{{ (8 | random(seed=inventory_hostname)*32) + 1}}/27"]
{% endif %}
{% if vxlans is defined %}
{% for v in vxlans %}
br-vxlan{{v}}:
interfaces:
- vxlan{{v}}
{% if vxlan_access is defined %}
{% for i in vxlan_access %}
{% if i.vni == v %}
- vxlan{{i.vni}}-access
{% endif %}
{% endfor %}
{% endif %}
{% endfor %}
{% endif %}
There is a lot of unrelated configuration in here; ultimately if list vxlan_access
is defined, each instance gets its appropriate bond, ES, etc. provisioned. This structure allows me to provision physical access to multiple vxlans if needed.
These templates refer to ‘dry’ inventory vars:
spine:
hosts:
10.0.0.10:
router_ip: 10.0.0.10
10.0.0.11:
router_ip: 10.0.0.11
vars:
anycast_addresses: []
reserved_ports: [enp2s0f0]
asn: 64512
l2_access: True
vxlans: [100]
vxlan_access:
- port: enp2s0f0
vni: 100
es_id: 10
es_sys_mac: 00:00:10:00:00:10