The two biggest factors in real-world performance of any network storage platform are network latency and disk latency. “real world performance” of course connotes synchronous 4k random writes.
In replicated ceph, latency is even more critical to cluster performance than for nfs or cifs. To write data, a client chooses and writes to a primary pg - which then independently handles replication to n remaining replica pgs. The client has to wait until all replication is finished, meaning a write involves at least 4 logical network hops, assuming client and osds are equally connected to a switch or router.
- client -> primary replica
- primary replica -> tertiary replicas
- tertiary -> primary write ack
- primary -> client write ack
I do not have particularly deep knowledge about the interworkings of ceph - so it is likely more complicated than this, but this basic understanding at least illustrates the problem.
Enter tuned
, and specifically its network-latency
profile. What follows is my testing of its effect on linux routers.
I have 4 ceph nodes interconnected with two spines, as well as a client. All run bgp, frr, debian. OSDs are consumer sata ssds. Network links are all 1000BASE-T.
Here are pings across a single spine. Two runs of 100 each. In all tests, I am applying tuned changes to routers only, not ceph nodes.
tuned-adm profile balanced
:
ping 10.0.200.1 -c 100
rtt min/avg/max/mdev = 0.219/0.299/0.622/0.053 ms
rtt min/avg/max/mdev = 0.220/0.310/0.908/0.091 ms
tuned-adm profile network-latency
:
ping 10.0.200.1 -c 100
rtt min/avg/max/mdev = 0.192/0.281/1.112/0.099 ms
rtt min/avg/max/mdev = 0.185/0.263/0.448/0.034 ms
11% improvement.
fio randwrite on cephfs, mounted on equally-connected non-ceph leaf:
tuned-adm profile balanced
both spines:
rm f; fio --filename=f --size=1GB --direct=1 --rw=randwrite --bs=4k --runtime=120 --numjobs=8 --time_based --group_reporting --name=fio
write: IOPS=259, BW=1040KiB/s (1065kB/s)(122MiB/120059msec); 0 zone resets
tuned-adm profile network-latency
:
rm f; fio --filename=f --size=1GB --direct=1 --rw=randwrite --bs=4k --runtime=120 --numjobs=8 --time_based --group_reporting --name=fio
write: IOPS=315, BW=1264KiB/s (1294kB/s)(148MiB/120030msec); 0 zone resets
18% improvement in ceph IOPs!
Larger test:
tuned-adm profile balanced
:
rm f; fio --filename=f --size=10GB --direct=1 --rw=randwrite --bs=4k --runtime=240 --numjobs=8 --time_based --group_reporting --name=fio
write: IOPS=195, BW=783KiB/s (801kB/s)(183MiB/240062msec); 0 zone resets
tuned-adm profile network-latency
:
rm f; fio --filename=f --size=10GB --direct=1 --rw=randwrite --bs=4k --runtime=240 --numjobs=8 --time_based --group_reporting --name=fio
write: IOPS=215, BW=862KiB/s (883kB/s)(202MiB/240044msec); 0 zone resets
9.31% improvement over longer duration test.
This comes at a cost of power consumption, which is why I don’t apply it to all cluster nodes. These routers however are low power Denverton boxes, so that 9% comes at the cost of just a few watts.
farther reading: