While BGP-to-the-host solves a lot of classic datacenter networking problems - if we’re using some sort of config management to manage our network (as we should), we also happen to have created a killswitch for the entire network should a bad change make its way through. Therefore in addition to a carefully thought out deployment strategy, some sort of isolated regression/integration testing is a basic requirement.
To achieve this I’ve written bgp-unnumbered/testnet/ to deploy and test the bgp-unnumbered role on LXD containers. We’re not simply ’testing the playbook’; these containers are linked up in a spine-leaf network using linux bridges as point to point l2 segments, and we’ll run ‘real’ networking tests on them once provisioned.
usage
First we’ll create the network. -s2 -l3
means two spines and three leafs.
The tool will generate an ssh key, create the instances, connect each leaf directly to each spine, generate an inventory, and run ansible-playbook
:
[nix-shell:~/git/bgp-unnumbered]$ python3 testnet --create -s2 -l3
waiting for lxd agent to become ready on bgp-spine-414f3
waiting for lxd agent to become ready on bgp-leaf-109d5
waiting for lxd agent to become ready on bgp-leaf-65255
creating network 109d5-414f3
creating network 109d5-7a71f
creating network 65255-414f3
creating network 65255-7a71f
creating network 878ad-414f3
creating network 878ad-7a71f
waiting for lxd agent to become ready on bgp-spine-414f3
waiting for lxd agent to become ready on bgp-spine-7a71f
waiting for lxd agent to become ready on bgp-leaf-109d5
waiting for lxd agent to become ready on bgp-leaf-65255
waiting for valid ipv4 address on bgp-spine-414f3
waiting for valid ipv4 address on bgp-spine-7a71f
waiting for valid ipv4 address on bgp-leaf-109d5
waiting for valid ipv4 address on bgp-leaf-65255
waiting for valid ipv4 address on bgp-leaf-878ad
PLAY [all] *********************************************************************
TASK [Gathering Facts] *********************************************************
ok: [10.64.212.236]
ok: [10.64.212.123]
ok: [10.64.212.102]
ok: [10.64.212.25]
ok: [10.64.212.196]
TASK [frr : Install debian tools] **********************************************
changed: [10.64.212.236]
changed: [10.64.212.102]
changed: [10.64.212.25]
changed: [10.64.212.196]
changed: [10.64.212.123]
TASK [frr : install isc-dhcp-server] *******************************************
skipping: [10.64.212.102]
skipping: [10.64.212.123]
skipping: [10.64.212.25]
skipping: [10.64.212.236]
skipping: [10.64.212.196]
TASK [frr : Reconcile interfaces to configure] *********************************
ok: [10.64.212.102] => (item=eth2935)
skipping: [10.64.212.102] => (item=lo)
ok: [10.64.212.123] => (item=eth0)
ok: [10.64.212.102] => (item=eth1739)
ok: [10.64.212.123] => (item=eth9077)
ok: [10.64.212.102] => (item=eth9077)
ok: [10.64.212.25] => (item=eth7369)
ok: [10.64.212.102] => (item=eth0)
TASK [frr : Enable frr] ********************************************************
ok: [10.64.212.102]
ok: [10.64.212.123]
ok: [10.64.212.25]
ok: [10.64.212.236]
ok: [10.64.212.196]
-- truncated --
TASK [frr : Reboot] ************************************************************
changed: [10.64.212.196]
changed: [10.64.212.25]
changed: [10.64.212.236]
changed: [10.64.212.102]
changed: [10.64.212.123]
PLAY RECAP *********************************************************************
10.64.212.102 : ok=15 changed=11 unreachable=0 failed=0 skipped=6 rescued=0 ignored=0
10.64.212.123 : ok=15 changed=11 unreachable=0 failed=0 skipped=6 rescued=0 ignored=0
10.64.212.196 : ok=15 changed=11 unreachable=0 failed=0 skipped=6 rescued=0 ignored=0
10.64.212.236 : ok=15 changed=11 unreachable=0 failed=0 skipped=6 rescued=0 ignored=0
10.64.212.25 : ok=15 changed=11 unreachable=0 failed=0 skipped=6 rescued=0 ignored=0
environment created. follow-up configuration can be performed with:
ansible-playbook testnet.yml -i virtual.inventory
Notice it gave us a command hint to run ansible again without reprovisioning anything. The generated ssh key is linked within virtual.inventory
, so we never have to think about authentication.
Its also created a bunch of bridges on our host (through the LXD api):
615: 109d5-414f3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
link/ether 00:16:3e:dc:4f:a9 brd ff:ff:ff:ff:ff:ff
616: 109d5-414f3-mtu: <BROADCAST,NOARP,UP,LOWER_UP> mtu 9000 qdisc noqueue master 109d5-414f3 state UNKNOWN group default qlen 1000
link/ether 06:d1:f8:ea:79:34 brd ff:ff:ff:ff:ff:ff
inet6 fe80::1c24:6bff:fe9a:d724/64 scope link proto kernel_ll
valid_lft forever preferred_lft forever
617: 109d5-7a71f: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
link/ether 00:16:3e:4e:54:1b brd ff:ff:ff:ff:ff:ff
618: 109d5-7a71f-mtu: <BROADCAST,NOARP,UP,LOWER_UP> mtu 9000 qdisc noqueue master 109d5-7a71f state UNKNOWN group default qlen 1000
link/ether 32:ca:56:01:d1:cb brd ff:ff:ff:ff:ff:ff
inet6 fe80::30ca:56ff:fe01:d1cb/64 scope link proto kernel_ll
valid_lft forever preferred_lft forever
619: 65255-414f3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
link/ether 00:16:3e:bb:bd:b5 brd ff:ff:ff:ff:ff:ff
620: 65255-414f3-mtu: <BROADCAST,NOARP,UP,LOWER_UP> mtu 9000 qdisc noqueue master 65255-414f3 state UNKNOWN group default qlen 1000
link/ether 1a:fa:bf:f8:cf:ce brd ff:ff:ff:ff:ff:ff
inet6 fe80::18fa:bfff:fef8:cfce/64 scope link proto kernel_ll
valid_lft forever preferred_lft forever
109d5-414f3
connects bgp-leaf-109d5
to bgp-spine-414f3
and so on.
testing
With the network provisioned we can run our tests. This is done using pytest
wrapped with tox
.
[nix-shell:~/git/bgp-unnumbered]$ tox
lint: commands[0]> black testnet
All done! ✨ 🍰 ✨
4 files left unchanged.
lint: commands[1]> flake8 testnet --ignore E501
lint: OK ✔ in 0.25 seconds
py311: commands[0]> pytest tests
============================================================================================================================= test session starts =============================================================================================================================
platform linux -- Python 3.11.6, pytest-7.4.4, pluggy-1.3.0
cachedir: .tox/py311/.pytest_cache
rootdir: /home/nhensel/git/bgp-unnumbered
collected 3 items
tests/test_bandwidth.py . [ 33%]
tests/test_ecmp.py . [ 66%]
tests/test_icmp.py . [100%]
======================================================================================================================== 3 passed, 1 warning in 13.24s ========================================================================================================================
lint: OK (0.25=setup[0.03]+cmd[0.10,0.12] seconds)
py311: OK (13.39=setup[0.00]+cmd[13.39] seconds)
congratulations :) (13.70 seconds)
Currently I have three tests, bandwidth, ecmp, and icmp.
tests/test_ecmp.py
ensures that equal-cost-multipathing is working. Each leaf should have n routes to each other leaf, where n is the number of spines:
routes = err.stdout.count("via inet6 fe80")
print("{} has {} routes to {}".format(i.name, routes, j.name))
assert routes == n_spines
If I shut down frr on one of the spines, tox will fail:
> assert routes == n_spines
E assert 1 == 2
tests/test_ecmp.py:36: AssertionError
---------------------------------------------------------------------------------------------------------------------------- Captured stdout call -----------------------------------------------------------------------------------------------------------------------------
bgp-leaf-109d5 has 1 routes to bgp-leaf-65255
There are numerous ways to break ecmp within the frr or kernel configuration, but this test will ensure we don’t.
I originally wrote this to use KVM vms for the test nodes, but since LXD containers have ‘real’ network stacks they’re a very good fit - they’re blazing fast.
test_bandwidth.py
ensures each leaf has 50gbps of throughput to each other leaf. This is done by starting up an iperf server on each leaf and running iperf -c
from each leaf to each other leaf. This 50 number is admittedly pretty arbitrary, as it will of course be more or less on different hardware.
This test is indeed effective in exposing something like an mtu mismatch. If we try to force routers with 9000 mtu interfaces to communicate over a 6000 mtu ptp link we will get significantly less throughput:
ip a show 2854a-6a3ba
717: 2854a-6a3ba: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 6000 qdisc noqueue state UP group default qlen 1000
link/ether 00:16:3e:26:9f:c3 brd ff:ff:ff:ff:ff:ff
root@bgp-leaf-2854a:~# ip a show bgpeth0
2: bgpeth0@eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
link/ether 00:16:3e:f4:79:45 brd ff:ff:ff:ff:ff:ff
inet6 fe80::216:3eff:fef4:7945/64 scope link
valid_lft forever preferred_lft forever
[nix-shell:~/git/bgp-unnumbered]$ tox
...
tests/test_bandwidth.py F [ 33%]
tests/test_ecmp.py . [ 66%]
tests/test_icmp.py . [100%]
...
if len(gigabits) == 0:
> raise ValueError(
"error fetching iperf output. is the bandwidth < 1 Gbit?"
)
E ValueError: error fetching iperf output. is the bandwidth < 1 Gbit?
tests/test_bandwidth.py:52: ValueError
------------------------------------------------- Captured stdout call --------------------------------------------------
iperf: bgp-leaf-2854a -> bgp-leaf-32ea5
------------------------------------------------------------
Client connecting to 10.0.200.123, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 1] local 10.0.200.136 port 38410 connected with 10.0.200.123 port 5001 (icwnd/mss/irtt=87/8948/24)
[ 2] local 10.0.200.136 port 38400 connected with 10.0.200.123 port 5001 (icwnd/mss/irtt=87/8948/36)
[ ID] Interval Transfer Bandwidth
[ 1] 0.0000-2.1363 sec 699 KBytes 2.68 Mbits/sec
[ 2] 0.0000-2.1362 sec 682 KBytes 2.61 Mbits/sec
[SUM] 0.0000-0.1014 sec 1.35 MBytes 112 Mbits/sec
In fact, our regex failed because it didn’t find any ‘Gbits’ whatsoever.