Project Page: 

CohortFS OSD Reaches 1.3M Network-Bypass IOPs

There was jubilation around the office back in October 2014 when the internal bandwidth performance of the CohortFS OSD data path reached and surpassed the random write bandwidth of FDR Infiniband (just under 7000MB/s, at the time).

But while bandwidth performance was getting good, the corresponding minimal operation latency and IOPs still lagged far behind. The CohortFS RDMA messaging for Ceph protocols--built with support and collaboration from Mellanox--was by then capable of just over 750K IOPs on Mellanox FDR switches and Connect-X3 HCAs (yes, there was still plenty of room for optimization--the equivalent Accelio benchmark achieved 1.1M IOPs). The data path wasn’t keeping up!

Five months of large-scale reorganization later, things are starting to gel. While bandwidth performance continues to improve -- the most recent libosd benchmarks show peak bandwidth of over 11000MB/s on the hardware that formerly peaked at 7000MB/s -- the OSD data path has begun to deliver IOPs worth reporting on. The October OSD baseline had difficulty exceeding 50K minimum-latency operations in any configuration, with high variability during runs. The long 50K interregnum was succeeded by March baselines which peaked at ~200K IOPs, but reduced variability. In the end we think we overcame the remaining latency barriers just by removing and replacing code until there was nowhere to hide them. The April baseline has tamed much of the variability in performance/latency that used to characterize our workload runs. At the same time, we’ve reached deterministic small-operation latencies of >1.3M IOPs on 3.3Ghz core i7, using 3 effective cores. There’s more speedup available, and scaling efficiently to more cores is still being worked on, but it feels like the IOPs drop ceiling has finally been ripped out of the CohortFS OSD.

Glossary (Just in case)