I have a number of servers interconnected with 1000BASE-T ethernet to two spine routers each in a clos topology. BGP daemons on these servers subsequently install equal-cost routes into the kernel routing tables:
On 10.0.200.1:
10.0.200.2 nhid 18 proto bgp metric 20
nexthop via inet6 fe80::2e0:edff:fe0a:bdb2 dev bgpenp35s0 weight 1
nexthop via inet6 fe80::2e0:edff:fe57:55be dev bgpenp36s0 weight 1
On 10.0.200.2:
10.0.200.1 nhid 16 proto bgp metric 20
nexthop via inet6 fe80::2e0:edff:fe0a:bdb0 dev bgpenp35s0 weight 1
nexthop via inet6 fe80::2e0:edff:fe57:55bc dev bgpenp36s0 weight 1
I recently learned that the linux kernel’s default ecmp policy is only layer 3 (source-ip + destination-ip); however, sysctl net.ipv4.fib_multipath_hash_policy
is available to enable l4 hashing. L4 hashing will incorporate source and target ports, meaning we should see its affect with iperf’s -P
option - each new tcp connection gets a random ephemeral client port.
sysctl net.ipv4.fib_multipath_hash_policy=0
iperf -c 10.0.200.1 -P3
------------------------------------------------------------
Client connecting to 10.0.200.1, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 3] local 10.0.200.2 port 35336 connected with 10.0.200.1 port 5001 (icwnd/mss/irtt=87/8948/402)
[ 1] local 10.0.200.2 port 35312 connected with 10.0.200.1 port 5001 (icwnd/mss/irtt=87/8948/434)
[ 2] local 10.0.200.2 port 35326 connected with 10.0.200.1 port 5001 (icwnd/mss/irtt=87/8948/412)
[ ID] Interval Transfer Bandwidth
[ 2] 0.0000-10.0546 sec 394 MBytes 329 Mbits/sec
[ 3] 0.0000-10.0543 sec 395 MBytes 330 Mbits/sec
[ 1] 0.0000-10.0543 sec 394 MBytes 329 Mbits/sec
[SUM] 0.0000-10.0202 sec 1.16 GBytes 991 Mbits/sec
As expected, source-ip destination-ip hashing means we only utilize one physical link, even with three streams.
sysctl net.ipv4.fib_multipath_hash_policy=1
iperf -c 10.0.200.1 -P3
------------------------------------------------------------
Client connecting to 10.0.200.1, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 1] local 10.0.200.2 port 37846 connected with 10.0.200.1 port 5001 (icwnd/mss/irtt=87/8948/325)
[ 2] local 10.0.200.2 port 37838 connected with 10.0.200.1 port 5001 (icwnd/mss/irtt=87/8948/305)
[ 3] local 10.0.200.2 port 37852 connected with 10.0.200.1 port 5001 (icwnd/mss/irtt=87/8948/228)
[ ID] Interval Transfer Bandwidth
[ 1] 0.0000-10.0240 sec 586 MBytes 490 Mbits/sec
[ 3] 0.0000-10.0236 sec 589 MBytes 493 Mbits/sec
[ 2] 0.0000-10.0236 sec 1.15 GBytes 986 Mbits/sec
[SUM] 0.0000-10.0053 sec 2.30 GBytes 1.97 Gbits/sec
With layer 4 hashing, we get full utilization(!)
The catch here is that some attempts will still by chance hash all connections to the same link, the likelihood of which decreases with an increase in clients. This is why I’m demonstrating this with -P3
. Notice how the individual streams are 490mbps, 493mbps, and 986mbps; clearly showing which connections shared a physical link.
This isn’t going to magically make file transfers or database dumps over the network faster, but my hope is that site-wide operations issuing many connections (like a ceph rebalance) should see greater network utilization.
If your application does have some way of using multiple tcp connections (NFS nconnect?), I would expect to see improvement.
Farther reading: