Dumb Switches as Multipath Spines with OSPF on NixOS

This was a quick experiment I did to validate the possibility of using cheap, low-power-consumption ‘unmanaged’ switches as spines for multipath connectivity between linux hosts.

This is possible with OSPF thanks to its ‘broadcast’ mode, which in short means we just have to allocate say a /27 network for the ‘blue’ spine and a /27 network for the ‘yellow’ spine, assign all the participating hosts addresses in those networks on the appropriate interfaces, and they will automatically peer with eachother. That is to say - we do not have to define specific peer relationships in our frr configs on our servers.

This is neat, because every line of configuration has a cost. With this approach - we do not have to configure the switches, we do not have to do too much in the way of configuring our servers, and we get equal-cost-multipathing over cheap hardware. While I’m only demonstrating this with 1000base-t, this could be quite useful with unmanaged 2.5g or 10g switches.

The blue boxes and the trendnet switches are the devices under test.

I am templating and deploying these configs with my configuration management tool nixa. Heres the config relevant for the servers:

  networking = {
    hostName = "{{attrs.hostname}}";
    defaultGateway = "172.30.190.1";
    nameservers = [ "1.1.1.1" ];
    enableIPv6 = false;
    firewall.enable = false;
    dhcpcd.enable = false;
    interfaces.enp3s0.ipv4.addresses = [{
      address = "{{hostvars["static_ip"]}}";
      prefixLength = 24;
    }];
    interfaces.lo.ipv4.addresses = [{
      address = "{{hostvars["loopback"]}}";
      prefixLength = 32;
    }];
    interfaces.enp0s20f2.ipv4.addresses = [{
      address = "{{hostvars["blue"]}}";
      prefixLength = 27;
    }];
    interfaces.enp0s20f3.ipv4.addresses = [{
      address = "{{hostvars["yellow"]}}";
      prefixLength = 27;
    }];
  };

  boot.kernel.sysctl = {
    "net.ipv4.conf.all.forwarding" = 1;
    "net.ipv4.fib_multipath_hash_policy" = 1;
    "net.ipv4.fib_multipath_use_neigh" = 1;
  };

  services.frr = {
    ospfd.enable = true;
    config = ''
      log syslog
      debug ospf event
      frr defaults datacenter

      int enp0s20f2
        ip ospf area 0
        ip ospf network broadcast

      int enp0s20f3
        ip ospf area 0
        ip ospf network broadcast

      router ospf
        ospf router-id {{hostvars["loopback"]}}
        redistribute connected
        redistribute kernel
        network 10.0.1.0/27 area 0
        network 10.0.1.32/27 area 0
    '';
  };

Heres the inventory being used with nixa:

ospf:
  hosts:
    172.30.190.201:
      static_ip: 172.30.190.201
      loopback: 10.0.0.1
      blue: 10.0.1.1
      yellow: 10.0.1.33
    172.30.190.202:
      static_ip: 172.30.190.202
      loopback: 10.0.0.2
      blue: 10.0.1.2
      yellow: 10.0.1.34
  templates:
    - ospf.nix
  nix-channel: nixos-24.11

Each node has a management interface enp3s0 that is not participating in ospf. Each node also has a /32 loopback address of either 10.0.0.1/32 or 10.0.0.2/32.

enp0s20f2 on each node is plugged into the ‘blue’ switch and given an ip under 10.0.1.0/27.

enp0s20f3 on each node is plugged into the ‘yellow’ switch and given an ip under 10.0.1.32/27.

We are using small /27 networks so to not waste ips. /27 should be good for up to 24 port spine switches. 48 port will need /26.

With that all in place, we should be able to see neighbors form:

[root@50ae2383-89c0-547c-9f56-5df657915a83:~]# vtysh <<<'show ip ospf neigh'

Hello, this is FRRouting (version 10.1).
Copyright 1996-2005 Kunihiro Ishiguro, et al.

50ae2383-89c0-547c-9f56-5df657915a83# show ip ospf neigh

Neighbor ID     Pri State           Up Time         Dead Time Address         Interface                        RXmtL RqstL DBsmL
10.0.0.2          1 Full/Backup     35m12s            36.223s 10.0.1.2        enp0s20f2:10.0.1.1                   0     0     0
10.0.0.2          1 Full/DR         36m19s            35.676s 10.0.1.34       enp0s20f3:10.0.1.33                  0     0     0

And routes to the /32 loopback addresses we assigned under 10.0.0.0/24:

[root@50ae2383-89c0-547c-9f56-5df657915a83:~]# ip r
default via 172.30.190.1 dev enp3s0 proto static
10.0.0.2 nhid 73 proto ospf metric 20
	nexthop via 10.0.1.34 dev enp0s20f3 weight 1
	nexthop via 10.0.1.2 dev enp0s20f2 weight 1
10.0.1.0/27 dev enp0s20f2 proto kernel scope link src 10.0.1.1
10.0.1.32/27 dev enp0s20f3 proto kernel scope link src 10.0.1.33
172.30.190.0/24 dev enp3s0 proto kernel scope link src 172.30.190.201

And finally, if everything is working right, we will see double the bandwidth - given enought TCP connections:

[root@50ae2383-89c0-547c-9f56-5df657915a83:~]# iperf -c 10.0.0.2 -P4 -t1
Connecting to host 10.0.0.2, port 5201
[  5] local 10.0.1.1 port 46562 connected to 10.0.0.2 port 5201
[  7] local 10.0.1.1 port 46578 connected to 10.0.0.2 port 5201
[  9] local 10.0.1.1 port 46580 connected to 10.0.0.2 port 5201
[ 11] local 10.0.1.1 port 46584 connected to 10.0.0.2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   114 MBytes   953 Mbits/sec    0    410 KBytes
[  7]   0.00-1.00   sec  57.2 MBytes   479 Mbits/sec    0    284 KBytes
[  9]   0.00-1.00   sec  29.0 MBytes   243 Mbits/sec    0    228 KBytes
[ 11]   0.00-1.00   sec  29.0 MBytes   243 Mbits/sec    0    226 KBytes
[SUM]   0.00-1.00   sec   229 MBytes  1.92 Gbits/sec    0
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-1.00   sec   114 MBytes   953 Mbits/sec    0             sender
[  5]   0.00-1.00   sec   112 MBytes   940 Mbits/sec                  receiver
[  7]   0.00-1.00   sec  57.2 MBytes   479 Mbits/sec    0             sender
[  7]   0.00-1.00   sec  56.1 MBytes   469 Mbits/sec                  receiver
[  9]   0.00-1.00   sec  29.0 MBytes   243 Mbits/sec    0             sender
[  9]   0.00-1.00   sec  28.0 MBytes   234 Mbits/sec                  receiver
[ 11]   0.00-1.00   sec  29.0 MBytes   243 Mbits/sec    0             sender
[ 11]   0.00-1.00   sec  28.1 MBytes   235 Mbits/sec                  receiver
[SUM]   0.00-1.00   sec   229 MBytes  1.92 Gbits/sec    0             sender
[SUM]   0.00-1.00   sec   225 MBytes  1.88 Gbits/sec                  receiver

iperf Done.

Like with LACP, this doesn’t magically double the speed of a single connection, but multiple streams do get load balanced over the multiple paths.

Im only using cheap 1g-baseT switches here, and im only demonstrating this with two nodes. However - one might imagine putting some dual port nics in a bunch of servers, finding the cheapest possible unmanaged 10g-baseT switches, and getting effectively 20gb between many hosts on the cheap.

limitations

Unfortunately, ospf on frr seems to lack some sort of equivalent to frr bgp’s bgp fast-convergence option. If I power cycle a switch, it takes about a minute for the route to get re-established. This is fine as long as you keep this delay in mind when moving cables around or rebooting things.

The bigger caveat perhaps has to do with what address ‘client’ programs on either node choose to bind to when speaking to other nodes.

If I simply ssh 10.0.0.2 from 10.0.0.1, we get a connection from 10.0.1.1.

sshd-session[3027]: Accepted publickey for root from 10.0.1.1 port 36368

This is a problem, because 10.0.1.1 is the client node’s link to the spine! That is to say, if that spine gets power cycled, 10.0.1.1 (the interface) will go down too, and the TCP connection and the ssh session will be dropped.

We want things to bind to the loopback addresses, not the interfaces. We can do this with ssh:

ssh 10.0.0.2 -b 10.0.0.1

This does indeed survive power cycling of each switch, no TCP dropped.

The problem here is that we will have to figure out how to achieve the same functionality with all the software using the link. Catching every instance of this is going to be quite the chore.

For example, NFS has srcaddr.

One of the larger items on my overall wishlist for linux is a way to set a ‘default source address for everything’. In the routing world, it is normal for a router’s primary ‘identity’ to be synonymous with its /32 loopback address, not whatever happens to be assigned to its varying interfaces. It is somewhat of an unfortunate path in history we have taken that this is not also the norm for servers.

As a final note - this caveat exposes yet another convenience of bgp-unnumbered on linux - if we literally only have one address on systems with arbitrarily many interfaces - and that is the /32 loopback address - programs have no choice but to bind to that, and we don’t run into this issue.

limitations

Nathan Hensel