Improving Ceph IOPs with TuneD on interconnecting routers

The two biggest factors in real-world performance of any network storage platform are network latency and disk latency. “real world performance” of course connotes synchronous 4k random writes.

In replicated ceph, latency is even more critical to cluster performance than for nfs or cifs. To write data, a client chooses and writes to a primary pg - which then independently handles replication to n remaining replica pgs. The client has to wait until all replication is finished, meaning a write involves at least 4 logical network hops, assuming client and osds are equally connected to a switch or router.

  • client -> primary replica
  • primary replica -> tertiary replicas
  • tertiary -> primary write ack
  • primary -> client write ack

I do not have particularly deep knowledge about the interworkings of ceph - so it is likely more complicated than this, but this basic understanding at least illustrates the problem.


Enter tuned, and specifically its network-latency profile. What follows is my testing of its effect on linux routers.

I have 4 ceph nodes interconnected with two spines, as well as a client. All run bgp, frr, debian. OSDs are consumer sata ssds. Network links are all 1000BASE-T.

Here are pings across a single spine. Two runs of 100 each. In all tests, I am applying tuned changes to routers only, not ceph nodes.

tuned-adm profile balanced:

ping 10.0.200.1 -c 100
rtt min/avg/max/mdev = 0.219/0.299/0.622/0.053 ms
rtt min/avg/max/mdev = 0.220/0.310/0.908/0.091 ms

tuned-adm profile network-latency:

ping 10.0.200.1 -c 100
rtt min/avg/max/mdev = 0.192/0.281/1.112/0.099 ms
rtt min/avg/max/mdev = 0.185/0.263/0.448/0.034 ms

11% improvement.


fio randwrite on cephfs, mounted on equally-connected non-ceph leaf:

tuned-adm profile balanced both spines:

rm f; fio --filename=f --size=1GB --direct=1 --rw=randwrite --bs=4k --runtime=120 --numjobs=8 --time_based --group_reporting --name=fio
  write: IOPS=259, BW=1040KiB/s (1065kB/s)(122MiB/120059msec); 0 zone resets

tuned-adm profile network-latency:

rm f; fio --filename=f --size=1GB --direct=1 --rw=randwrite --bs=4k --runtime=120 --numjobs=8 --time_based --group_reporting --name=fio
  write: IOPS=315, BW=1264KiB/s (1294kB/s)(148MiB/120030msec); 0 zone resets

18% improvement in ceph IOPs!


Larger test:

tuned-adm profile balanced:

rm f; fio --filename=f --size=10GB --direct=1 --rw=randwrite --bs=4k --runtime=240 --numjobs=8 --time_based --group_reporting --name=fio
  write: IOPS=195, BW=783KiB/s (801kB/s)(183MiB/240062msec); 0 zone resets

tuned-adm profile network-latency:

rm f; fio --filename=f --size=10GB --direct=1 --rw=randwrite --bs=4k --runtime=240 --numjobs=8 --time_based --group_reporting --name=fio
  write: IOPS=215, BW=862KiB/s (883kB/s)(202MiB/240044msec); 0 zone resets

9.31% improvement over longer duration test.

This comes at a cost of power consumption, which is why I don’t apply it to all cluster nodes. These routers however are low power Denverton boxes, so that 9% comes at the cost of just a few watts.


farther reading:

Nathan Hensel

on caving, mountaineering, networking, computing, electronics


2023-08-09