This was a quick experiment I did to validate the possibility of using cheap, low-power-consumption ‘unmanaged’ switches as spines for multipath connectivity between linux hosts.
This is possible with OSPF thanks to its ‘broadcast’ mode, which in short means we just have to allocate say a /27 network for the ‘blue’ spine and a /27 network for the ‘yellow’ spine, assign all the participating hosts addresses in those networks on the appropriate interfaces, and they should be able to peer with eachother. That is to say - we do not have to define specific peer relationships in our frr configs on our servers.
This is neat, because every line of configuration has a cost. With this approach - we do not have to configure the switches, and we do not have to do much in the way of configuring our servers, and we get equal-cost-multipathing over cheap hardware.
I am templating and deploying these configs with my configuration management tool nixa. Heres the config relevant for the servers:
networking = {
hostName = "{{attrs.hostname}}";
defaultGateway = "172.30.190.1";
nameservers = [ "1.1.1.1" ];
enableIPv6 = false;
firewall.enable = false;
dhcpcd.enable = false;
interfaces.enp3s0.ipv4.addresses = [{
address = "{{hostvars["static_ip"]}}";
prefixLength = 24;
}];
interfaces.lo.ipv4.addresses = [{
address = "{{hostvars["loopback"]}}";
prefixLength = 32;
}];
interfaces.enp0s20f2.ipv4.addresses = [{
address = "{{hostvars["blue"]}}";
prefixLength = 27;
}];
interfaces.enp0s20f3.ipv4.addresses = [{
address = "{{hostvars["yellow"]}}";
prefixLength = 27;
}];
};
boot.kernel.sysctl = {
"net.ipv4.conf.all.forwarding" = 1;
"net.ipv4.fib_multipath_hash_policy" = 1;
"net.ipv4.fib_multipath_use_neigh" = 1;
};
services.frr = {
ospfd.enable = true;
config = ''
log syslog
debug ospf event
frr defaults datacenter
int enp0s20f2
ip ospf area 0
ip ospf network broadcast
int enp0s20f3
ip ospf area 0
ip ospf network broadcast
router ospf
ospf router-id {{hostvars["loopback"]}}
redistribute connected
redistribute kernel
network 10.0.1.0/27 area 0
network 10.0.1.32/27 area 0
'';
};
Heres the inventory being used with nixa:
ospf:
hosts:
172.30.190.201:
static_ip: 172.30.190.201
loopback: 10.0.0.1
blue: 10.0.1.1
yellow: 10.0.1.33
172.30.190.202:
static_ip: 172.30.190.202
loopback: 10.0.0.2
blue: 10.0.1.2
yellow: 10.0.1.34
templates:
- ospf.nix
nix-channel: nixos-24.11
Each node has a management interface enp3s0
that is not participating in ospf. Each node also has a /32 loopback address of either 10.0.0.1/32 or 10.0.0.2/32.
enp0s20f2
on each node is plugged into the ‘blue’ switch and given an ip under 10.0.1.0/27.
enp0s20f3
on each node is plugged into the ‘yellow’ switch and given an ip under 10.0.1.32/27.
We are using small /27 networks so to not waste ips. /27 should be good for up to 24 port spine switches. 48 port will need /26.
With that all in place, we should be able to see neighbors form:
[root@50ae2383-89c0-547c-9f56-5df657915a83:~]# vtysh <<<'show ip ospf neigh'
Hello, this is FRRouting (version 10.1).
Copyright 1996-2005 Kunihiro Ishiguro, et al.
50ae2383-89c0-547c-9f56-5df657915a83# show ip ospf neigh
Neighbor ID Pri State Up Time Dead Time Address Interface RXmtL RqstL DBsmL
10.0.0.2 1 Full/Backup 35m12s 36.223s 10.0.1.2 enp0s20f2:10.0.1.1 0 0 0
10.0.0.2 1 Full/DR 36m19s 35.676s 10.0.1.34 enp0s20f3:10.0.1.33 0 0 0
And routes to the /32 loopback addresses we assigned under 10.0.0.0/24:
[root@50ae2383-89c0-547c-9f56-5df657915a83:~]# ip r
default via 172.30.190.1 dev enp3s0 proto static
10.0.0.2 nhid 73 proto ospf metric 20
nexthop via 10.0.1.34 dev enp0s20f3 weight 1
nexthop via 10.0.1.2 dev enp0s20f2 weight 1
10.0.1.0/27 dev enp0s20f2 proto kernel scope link src 10.0.1.1
10.0.1.32/27 dev enp0s20f3 proto kernel scope link src 10.0.1.33
172.30.190.0/24 dev enp3s0 proto kernel scope link src 172.30.190.201
And finally, if everything is working right, we will see double the bandwidth - given enought TCP connections:
[root@50ae2383-89c0-547c-9f56-5df657915a83:~]# iperf -c 10.0.0.2 -P4 -t1
Connecting to host 10.0.0.2, port 5201
[ 5] local 10.0.1.1 port 46562 connected to 10.0.0.2 port 5201
[ 7] local 10.0.1.1 port 46578 connected to 10.0.0.2 port 5201
[ 9] local 10.0.1.1 port 46580 connected to 10.0.0.2 port 5201
[ 11] local 10.0.1.1 port 46584 connected to 10.0.0.2 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 114 MBytes 953 Mbits/sec 0 410 KBytes
[ 7] 0.00-1.00 sec 57.2 MBytes 479 Mbits/sec 0 284 KBytes
[ 9] 0.00-1.00 sec 29.0 MBytes 243 Mbits/sec 0 228 KBytes
[ 11] 0.00-1.00 sec 29.0 MBytes 243 Mbits/sec 0 226 KBytes
[SUM] 0.00-1.00 sec 229 MBytes 1.92 Gbits/sec 0
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-1.00 sec 114 MBytes 953 Mbits/sec 0 sender
[ 5] 0.00-1.00 sec 112 MBytes 940 Mbits/sec receiver
[ 7] 0.00-1.00 sec 57.2 MBytes 479 Mbits/sec 0 sender
[ 7] 0.00-1.00 sec 56.1 MBytes 469 Mbits/sec receiver
[ 9] 0.00-1.00 sec 29.0 MBytes 243 Mbits/sec 0 sender
[ 9] 0.00-1.00 sec 28.0 MBytes 234 Mbits/sec receiver
[ 11] 0.00-1.00 sec 29.0 MBytes 243 Mbits/sec 0 sender
[ 11] 0.00-1.00 sec 28.1 MBytes 235 Mbits/sec receiver
[SUM] 0.00-1.00 sec 229 MBytes 1.92 Gbits/sec 0 sender
[SUM] 0.00-1.00 sec 225 MBytes 1.88 Gbits/sec receiver
iperf Done.
Like with LACP, this doesn’t magically double the speed of a single connection, but multiple streams do get load balanced over the multiple paths.
Im only using cheap 1g-baseT switches here, and im only demonstrating this with two nodes. However - one might imagine putting some dual port nics in a bunch of servers, finding the cheapest possible unmanaged 10g-baseT switches, and getting effectively 20gb between many hosts on the cheap.
limitations
Unfortunately, ospf on frr seems to lack some sort of equivalent to frr bgp’s bgp fast-convergence
option. If I power cycle a switch, it takes about a minute for the route to get re-established. This is fine as long as you keep this delay in mind when moving cables around or rebooting things.
The bigger caveat perhaps has to do with what address ‘client’ programs on either node choose to bind to when speaking to other nodes.
If I simply ssh 10.0.0.2
from 10.0.0.1, we get a connection from 10.0.1.1.
sshd-session[3027]: Accepted publickey for root from 10.0.1.1 port 36368
This is a problem, because 10.0.1.1 is the client node’s link to the spine! That is to say, if that spine gets power cycled, 10.0.1.1 (the interface) will go down too, and the TCP connection and the ssh session will be dropped.
We want things to bind to the loopback addresses, not the interfaces. We can do this with ssh:
ssh 10.0.0.2 -b 10.0.0.1
This does indeed survive power cycling of each switch, no TCP dropped.
The problem here is that we will have to figure out how to achieve the same functionality with all the software using the link. Catching every instance of this is going to be quite the chore.
For example, NFS has srcaddr.
One of the larger items on my overall wishlist for linux is a way to set a ‘default source address for everything’. In the routing world, it is normal for a router’s primary ‘identity’ to be synonymous with its /32 loopback address, not whatever happens to be assigned to its varying interfaces. It is somewhat of an unfortunate path in history that we taken where this is is not also the norm for servers.
As a final note - this caveat exposes yet another convenience of bgp-unnumbered on linux - if we literally only have one address on systems with arbitrarily many interfaces - and that is the /32 loopback address - programs have no choice but to bind to that, and we don’t run into this issue.