Verifying linux multipath route scaling with iperf

I have a number of servers interconnected with 1000BASE-T ethernet to two spine routers each in a clos topology. BGP daemons on these servers subsequently install equal-cost routes into the kernel routing tables:

On 10.0.200.1:

10.0.200.2 nhid 18 proto bgp metric 20
	nexthop via inet6 fe80::2e0:edff:fe0a:bdb2 dev bgpenp35s0 weight 1
	nexthop via inet6 fe80::2e0:edff:fe57:55be dev bgpenp36s0 weight 1

On 10.0.200.2:

10.0.200.1 nhid 16 proto bgp metric 20
	nexthop via inet6 fe80::2e0:edff:fe0a:bdb0 dev bgpenp35s0 weight 1
	nexthop via inet6 fe80::2e0:edff:fe57:55bc dev bgpenp36s0 weight 1

I recently learned that the linux kernel’s default ecmp policy is only layer 3 (source-ip + destination-ip); however, sysctl net.ipv4.fib_multipath_hash_policy is available to enable l4 hashing. L4 hashing will incorporate source and target ports, meaning we should see its affect with iperf’s -P option - each new tcp connection gets a random ephemeral client port.

sysctl net.ipv4.fib_multipath_hash_policy=0

iperf -c 10.0.200.1 -P3
------------------------------------------------------------
Client connecting to 10.0.200.1, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[  3] local 10.0.200.2 port 35336 connected with 10.0.200.1 port 5001 (icwnd/mss/irtt=87/8948/402)
[  1] local 10.0.200.2 port 35312 connected with 10.0.200.1 port 5001 (icwnd/mss/irtt=87/8948/434)
[  2] local 10.0.200.2 port 35326 connected with 10.0.200.1 port 5001 (icwnd/mss/irtt=87/8948/412)
[ ID] Interval       Transfer     Bandwidth
[  2] 0.0000-10.0546 sec   394 MBytes   329 Mbits/sec
[  3] 0.0000-10.0543 sec   395 MBytes   330 Mbits/sec
[  1] 0.0000-10.0543 sec   394 MBytes   329 Mbits/sec
[SUM] 0.0000-10.0202 sec  1.16 GBytes   991 Mbits/sec

As expected, source-ip destination-ip hashing means we only utilize one physical link, even with three streams.

sysctl net.ipv4.fib_multipath_hash_policy=1

iperf -c 10.0.200.1 -P3
------------------------------------------------------------
Client connecting to 10.0.200.1, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[  1] local 10.0.200.2 port 37846 connected with 10.0.200.1 port 5001 (icwnd/mss/irtt=87/8948/325)
[  2] local 10.0.200.2 port 37838 connected with 10.0.200.1 port 5001 (icwnd/mss/irtt=87/8948/305)
[  3] local 10.0.200.2 port 37852 connected with 10.0.200.1 port 5001 (icwnd/mss/irtt=87/8948/228)
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-10.0240 sec   586 MBytes   490 Mbits/sec
[  3] 0.0000-10.0236 sec   589 MBytes   493 Mbits/sec
[  2] 0.0000-10.0236 sec  1.15 GBytes   986 Mbits/sec
[SUM] 0.0000-10.0053 sec  2.30 GBytes  1.97 Gbits/sec

With layer 4 hashing, we get full utilization(!)

The catch here is that some attempts will still by chance hash all connections to the same link, the likelihood of which decreases with an increase in clients. This is why I’m demonstrating this with -P3. Notice how the individual streams are 490mbps, 493mbps, and 986mbps; clearly showing which connections shared a physical link.

This isn’t going to magically make file transfers or database dumps over the network faster, but my hope is that site-wide operations issuing many connections (like a ceph rebalance) should see greater network utilization.

If your application does have some way of using multiple tcp connections (NFS nconnect?), I would expect to see improvement.

Farther reading:

Nathan Hensel

on caving, mountaineering, networking, computing, electronics


2023-07-23