Low cost Multipath VXLAN access with EVPN-MH on Debian

These are my notes on getting EVPN Multihoming working on x86 Debian routers.

I have a number of servers running unnumbered bgp-to-the-host, plugged into two spine routers each. The routers themselves are low power denverton c3000 boxes; everything runs frr. This is a poor-mans clos datacenter network; since properly licensed ONIE hardware seems to be unobtainium for personal use - besides, a pair of surplus datacenter switches would certainly not fit in my power budget.

Beside the differences in interface configuration, most of the information on this page should be transferrable to Cumulus linux or similar.

With the dataplane ‘solved’, there is still a need for a basic (but multi-homed) L2 segment in the rack for things like IPMI and UPS access. Turns out, the modern bgp-evpn datacenter has an open solution for this - EVPN Multihoming. In evpn-mh, Ethernet Segment Ids inform the evpn control plane of L2 segments with links to multiple vxlan VTEPS (virtual tunnel endpoints) - preventing loops and enabling active-active multipathing. This is typically used as a replacement for MLAG from leafs to servers, but it should also work from spines to an l2 leaf switch. All the switch needs to support is 802.3ad LACP.

What all this means is I can plug a cheap (low power) prosumer switch into both the spines, configure the vxlan and ethernet segment on the relevant ports on each spine, and we will get multipathing and failover access to the switch - no additional hardware needed. The vxlan still needs a north/south gateway somewhere in the network; this page will not cover this. (future work - anycast vxlan gateway on linux?)


Heres what it all looks like:

There is a ‘blue’ spine and a ‘yellow’ spine. A blue/yellow pair plug into ports 23 and 24 on the switch. Switch is a Procurve 1800-24g - easily available for < 50$ on ebay. Bread ties aren’t my cable management tool of choice btw; they came on the cables, so may as well reduce and reuse.

Here is a logical representation. Green links represent ipmi and ups access:


Spines are 10.0.0.10 and 10.0.0.11. Here is netplan.yml from 10.0.0.10.

The important bits here are enp2s0f0, vxlan100, br-vxlan100, and vxlan100-access. The switch is plugged into enp2s0f0 on each spine. I did try applying the frr evpn-mh config directly to enp2s0f0, it did not work; they need to be single port bonds.

network:
  version: 2
  renderer: networkd
  ethernets:
    lo:
      addresses:
        - 10.0.0.10/32
    enp2s0f0:
      dhcp4: no
    enp2s0f1:
      addresses: ["172.30.201.129/27"]
      optional: true
      mtu: 9000
    enp2s0f2:
      addresses: ["172.30.92.161/27"]
      optional: true
      mtu: 9000
    enp2s0f3:
      addresses: ["172.30.188.33/27"]
      optional: true
      mtu: 9000
    enp6s0f0:
      addresses: ["172.30.242.97/27"]
      optional: true
      mtu: 9000
    enp6s0f1:
      addresses: ["172.30.241.33/27"]
      optional: true
      mtu: 9000
    enp8s0f0:
      addresses: ["172.30.34.65/27"]
      optional: true
      mtu: 9000
    enp8s0f1:
      addresses: ["172.30.160.97/27"]
      optional: true
      mtu: 9000
  vlans:
    bgpenp2s0f1:
      link: enp2s0f1
      id: 10
    bgpenp2s0f2:
      link: enp2s0f2
      id: 10
    bgpenp2s0f3:
      link: enp2s0f3
      id: 10
    bgpenp6s0f0:
      link: enp6s0f0
      id: 10
    bgpenp6s0f1:
      link: enp6s0f1
      id: 10
    bgpenp8s0f0:
      link: enp8s0f0
      id: 10
    bgpenp8s0f1:
      link: enp8s0f1
      id: 10
  tunnels:
    vxlan100:
      mode: vxlan
      id: 100
      local: 10.0.0.10
      mac-learning: false
      mtu: 8950
  bonds:
    vxlan100-access:
      macaddress: 00:00:10:00:00:10
      dhcp4: no
      interfaces:
        - enp2s0f0
      parameters:
        mode: 802.3ad
  bridges:
    br-dhcpd:
      addresses: ["172.30.11.1/27"]
    br-vxlan100:
      interfaces:
        - vxlan100
        - vxlan100-access

In summary, 10.0.0.10 is the routers primary ip and bgp router-id. vxlan100 is the VTEP. It is bridged to br-vxlan100 for farther use. enp2s0f0 is brought up with no ip. vxlan100-access is brought up as an lacp bond containing enp2s0f0. vxlan100-access is bridged to br-vxlan100. The ethernet segment sys mac is assigned to vxlan100-access and is duplicated across all instances.

Here is /etc/frr/frr.conf on 10.0.0.10. The important part here is interface vxlan100-access, and of course address-family l2vpn evpn et al. The same es-id and es-sys-mac are configured on vxlan100-access of 10.0.0.11.

log syslog
debug bgp
debug zebra vxlan
debug zebra evpn mh es
debug zebra evpn mh mac
debug zebra evpn mh neigh
frr defaults datacenter
zebra nexthop-group keep 1

interface vxlan100-access
  evpn mh es-id 10
  evpn mh es-sys-mac 00:00:10:00:00:10
  evpn mh uplink

router bgp 64512
  bgp router-id 10.0.0.10
  bgp fast-convergence
  bgp bestpath compare-routerid
  bgp bestpath as-path multipath-relax
  neighbor bgpenp2s0f1 interface remote-as external
  neighbor bgpenp2s0f2 interface remote-as external
  neighbor bgpenp2s0f3 interface remote-as external
  neighbor bgpenp6s0f0 interface remote-as external
  neighbor bgpenp6s0f1 interface remote-as external
  neighbor bgpenp8s0f0 interface remote-as external
  neighbor bgpenp8s0f1 interface remote-as external
  address-family ipv4 unicast
    neighbor bgpenp2s0f1 route-map default in
    neighbor bgpenp2s0f2 route-map default in
    neighbor bgpenp2s0f3 route-map default in
    neighbor bgpenp6s0f0 route-map default in
    neighbor bgpenp6s0f1 route-map default in
    neighbor bgpenp8s0f0 route-map default in
    neighbor bgpenp8s0f1 route-map default in
    redistribute connected
  address-family l2vpn evpn
    neighbor bgpenp2s0f1 activate
    neighbor bgpenp2s0f2 activate
    neighbor bgpenp2s0f3 activate
    neighbor bgpenp6s0f0 activate
    neighbor bgpenp6s0f1 activate
    neighbor bgpenp8s0f0 activate
    neighbor bgpenp8s0f1 activate
    advertise-all-vni

# match /32s under 10.0.0.0/24.  will not match 10.0.0.1/24-10.0.0.254/24
ip prefix-list p1 permit 10.0.0.0/24 ge 32
ip prefix-list p1 permit 192.168.1.0/24 ge 32
ip prefix-list p1 permit 10.0.100.0/24 ge 32
ip prefix-list p1 permit 10.0.200.0/24 ge 32
ip prefix-list p1 permit 172.16.0.0/16 le 26
ip prefix-list p1 permit 172.30.0.0/16 le 27
ip prefix-list p1 permit 0.0.0.0/0

route-map default permit 10
  match ip address prefix-list p1

If its all working, we will see the Ethernet Segment and associated VTEPs from another leaf in the network:

x10slhf-xeon-9c3ab# show evpn es
Type: B bypass, L local, R remote, N non-DF
ESI                            Type ES-IF                 VTEPs
03:00:00:10:00:00:10:00:00:0a  R    -                     10.0.0.10,10.0.0.11

We can also see which macs are bound to the Ethernet Segment 03:00:00:10:00:00:10:00:00:0a:

x10slhf-xeon-9c3ab# show evpn mac vni 100
Number of MACs (local and remote) known for this VNI: 7
Flags: N=sync-neighs, I=local-inactive, P=peer-active, X=peer-proxy
MAC               Type   Flags Intf/Remote ES/VTEP            VLAN  Seq #'s
d0:50:99:f2:39:e0 remote       03:00:00:10:00:00:10:00:00:0a        0/0
00:40:9d:43:35:97 remote       03:00:00:10:00:00:10:00:00:0a        0/0
3c:ec:ef:8c:1d:63 remote       03:00:00:10:00:00:10:00:00:0a        0/0
3c:ec:ef:8c:1e:0a remote       03:00:00:10:00:00:10:00:00:0a        0/0
a8:a1:59:08:bb:cb remote       03:00:00:10:00:00:10:00:00:0a        0/0
ee:2a:d0:35:be:dd remote       03:00:00:10:00:00:10:00:00:0a        0/0
00:1b:3f:9d:5d:c0 remote       03:00:00:10:00:00:10:00:00:0a        0/0

From the spines themselves, the ES and macs report as directly attached to the access interface:

nca1515-denverton-1983d# sho evpn es
Type: B bypass, L local, R remote, N non-DF
ESI                            Type ES-IF                 VTEPs
03:00:00:10:00:00:10:00:00:0a  L    vxlan100-access
nca1515-denverton-1983d# sho evpn mac vni 100
Number of MACs (local and remote) known for this VNI: 6
Flags: N=sync-neighs, I=local-inactive, P=peer-active, X=peer-proxy
MAC               Type   Flags Intf/Remote ES/VTEP            VLAN  Seq #'s
d0:50:99:f2:39:e0 local        vxlan100-access                1     0/0
00:40:9d:43:35:97 local        vxlan100-access                1     0/0
3c:ec:ef:8c:1d:63 local        vxlan100-access                1     0/0
a8:a1:59:08:bb:cb local        vxlan100-access                1     0/0
3c:ec:ef:8c:1e:0a local        vxlan100-access                1     0/0
00:1b:3f:9d:5d:c0 local        vxlan100-access                1     0/0

With all that in place, I can reboot one spine at a time without dropping ipmi remote console sessions:

rebooting spine .10:

and spine .11:

Watching tcpdump on each access port, multipathing is clearly working as intended. I also have telegraf shipping metrics to influxdb from each spine (this is the benefit of linux routing; be it ONIE or homebrew, we can run our standard devops stack). Here are tx and rx of each access port after running ipmi consoles for about an hour:

Heres evidence that we are indeed fooling the switch into thinking there is a single LACP peer on the other side:


I am of course not writing all of these configs by hand. Here is an ansible snippet for frr.conf:

{% if vxlan_access is defined %}
{% for i in vxlan_access %}
interface vxlan{{i.vni}}-access
  evpn mh es-id {{i.es_id}}
  evpn mh es-sys-mac {{i.es_sys_mac}}
  evpn mh uplink

{% endfor %}
{% endif %}
router bgp {% if asn is defined %}{{asn}}{% else %}{{ (1121 | random(seed=inventory_hostname)) + 64512 }}{% endif %}

  bgp router-id {{ router_ip }}
  bgp fast-convergence
  bgp bestpath compare-routerid
  bgp bestpath as-path multipath-relax
{% for i in interfaces|sort %}
  neighbor bgp{{i}} interface remote-as external
{% endfor %}
  address-family ipv4 unicast
{% for i in interfaces|sort %}
    neighbor bgp{{i}} route-map default in
{% endfor %}
    redistribute connected
{% if router_advertise is defined %}
{% for i in router_advertise|sort %}
    network {{i}}
{% endfor %}
{% endif %}
  address-family l2vpn evpn
{% for i in interfaces|sort %}
    neighbor bgp{{i}} activate
{% endfor %}
    advertise-all-vni

and netplan.yml:

network:
  version: 2
  renderer: networkd
  ethernets:
    lo:
      addresses:
        - {{router_ip}}/32
{% if anycast_addresses is defined %}
{% for i in anycast_addresses %}
        - {{i}}
{% endfor %}
{% endif %}
{% if vxlan_access is defined %}
{% for i in vxlan_access %}
    {{i.port}}:
      dhcp4: no
{% endfor %}
{% endif %}
{% for i in interfaces|sort %}
    {{i}}:
{% if l2_access is defined %}
      addresses: ["172.30.{{ 255 | random(seed=inventory_hostname+i)}}.{{ (8 | random(seed=inventory_hostname+i)*32) + 1}}/27"]
{% endif %}
      optional: true
      mtu: 9000
{% endfor %}
  vlans:
{% for i in interfaces|sort %}
    bgp{{i}}:
      link: {{i}}
      id: 10
{% endfor %}
{% if vxlans is defined %}
  tunnels:
{% for v in vxlans %}
    vxlan{{v}}:
      mode: vxlan
      id: {{v}}
      local: {{router_ip}}
      mac-learning: false
      mtu: 8950
{% endfor %}
{% endif %}
{% if vxlan_access is defined %}
  bonds:
{% for i in vxlan_access %}
    vxlan{{i.vni}}-access:
      macaddress: {{i.es_sys_mac}}
      dhcp4: no
      interfaces:
        - {{i.port}}
      parameters:
        mode: 802.3ad
{% endfor %}
{% endif %}
{% if l2_access is defined or vxlans is defined%}
  bridges:
{% endif %}
{% if l2_access is defined %}
    br-dhcpd:
      addresses: ["172.30.{{ 255 | random(seed=inventory_hostname)}}.{{ (8 | random(seed=inventory_hostname)*32) + 1}}/27"]
{% endif %}
{% if vxlans is defined %}
{% for v in vxlans %}
    br-vxlan{{v}}:
      interfaces:
        - vxlan{{v}}
{% if vxlan_access is defined %}
{% for i in vxlan_access %}
{% if i.vni == v %}
        - vxlan{{i.vni}}-access
{% endif %}
{% endfor %}
{% endif %}
{% endfor %}
{% endif %}

There is a lot of unrelated configuration in here; ultimately if list vxlan_access is defined, each instance gets its appropriate bond, ES, etc. provisioned. This structure allows me to provision physical access to multiple vxlans if needed.

These templates refer to ‘dry’ inventory vars:

spine:
  hosts:
    10.0.0.10:
      router_ip: 10.0.0.10
    10.0.0.11:
      router_ip: 10.0.0.11
  vars:
    anycast_addresses: []
    reserved_ports: [enp2s0f0]
    asn: 64512
    l2_access: True
    vxlans: [100]
    vxlan_access:
      - port: enp2s0f0
        vni: 100
        es_id: 10
        es_sys_mac: 00:00:10:00:00:10

Nathan Hensel

on caving, mountaineering, networking, computing, electronics


2023-08-07