Project: CohortFS OSD

There was jubilation around the office back in October 2014 when the internal bandwidth performance of the CohortFS OSD data path reached and surpassed the random write bandwidth of FDR Infiniband (just under 7000MB/s, at the time).

But while bandwidth performance was getting good, the corresponding minimal operation latency and IOPs still lagged far behind. The CohortFS RDMA messaging for Ceph protocols--built with support and collaboration from Mellanox--was by then capable of just over 750K IOPs on Mellanox FDR switches and Connect-X3 HCAs (yes, there was still plenty of room for optimization--the equivalent Accelio benchmark achieved 1.1M IOPs). The data path wasn’t keeping up!

Five months of large-scale reorganization later, things are starting to gel. While bandwidth performance continues to improve -- the most recent libosd benchmarks show peak bandwidth of over 11000MB/s on the hardware that formerly peaked at 7000MB/s -- the OSD data path has begun to deliver IOPs worth reporting on. The October OSD baseline had difficulty exceeding 50K minimum-latency operations in any configuration, with high variability during runs. The long 50K interregnum was succeeded by March baselines which peaked at ~200K IOPs, but reduced variability. In the end we think we overcame the remaining latency barriers just by removing and replacing code until there was nowhere to hide them. The April baseline has tamed much of the variability in performance/latency that used to characterize our workload runs. At the same time, we’ve reached deterministic small-operation latencies of >1.3M IOPs on 3.3Ghz core i7, using 3 effective cores. There’s more speedup available, and scaling efficiently to more cores is still being worked on, but it feels like the IOPs drop ceiling has finally been ripped out of the CohortFS OSD.

Glossary (Just in case)

As the CohortFS team sprints toward its first public software release (Rel0), the Cohort OSD developers are focusing intense effort on performance of the OSD I/O pipeline.

A CohortFS OSD shares a lot of code with a Ceph OSD, including network transports and run-time selectable backing stores (Memory, OS Filesystem, Key/Value). But there’s a lot of very different structure and behavior, driven by Cohort’s volume architecture and semantics. In CohortFS, a volume is plug-out data set abstraction, and Cohort data placement, I/O semantics and I/O pipeline vary by volume type. So in a Cohort cluster, deterministic, pseudo-random Ceph-style RUSH and CRUSH data placement volumes could sit side-by-side with CohortFS-style client erasure-coded volumes doing high-performance RAID-on-the-net.

For Rel0, the Cohort team is focused on optimizing performance for workloads that traditional Ceph doesn’t do well, using a bottom-up approach of streamlining the I/O pipeline, and re-working abstractions that limit current Ceph’s sequential or small random I/O operation performance and bandwidth.

XioMessenger

The work Cohort is best known for in the Ceph community is an RDMA transport prototype we built in cooperation with Mellanox that gets 750K IOPs / 6000 MB/s bandwidth in standalone testing. The goal for Rel0 is to deliver as much of XioMessenger’s available performance all the way to the Cohort OSD backing store as possible. We’re still a ways away from that goal, but we’re making some great progress.

MemStore

In real systems we're normally concerned with reading and writing to persistent storage, which has relatively high latency, but in optimization we frequently wish to understand available performance unrestricted by the latencies of disks or even SSDs. The best available performance in a Cohort (or Ceph) backing store is when the OSD is reading and writing straight to memory, with no file system or database overhead. The Cohort version of Ceph’s memory backing store started off with a bandwidth rate limit of around 500 MB/s on random write workloads to a few objects. By summer 2014, we had that number up to ~8000 MB/s, close to the maximum available performance on our test systems.

libosd, DirectMessenger

The Cohort developers removed a number of performance limitations in its version of MemStore. To do it, we constructed new, shortened I/O pipelines from workload generation clients. The most important mechanisms we developed for workload simulation are DirectMessenger and a new re-entrant version of the Cohort OSD, which let an instance of the fio workload-generation client run I/O workloads directly against the Cohort OSD MemStore (or other ObjectStore), bypassing the RBD and RADOS client subsystems and the ordinary network transport.

The big result from the last few weeks of Cohort OSD development is some great performance numbers from libosd/DirectMessenger on an fio random write workload to memory backing, which reliably reaches almost 7000 MB/s with 256K blocks. The upshot is that the bandwidth performance of Cohort OSD’s core request processing abstraction better than matches the available bandwidth of XioMessenger on 56-gigabit Infiniband, and so imposes no rate limit on end-to-end bandwidth in this configuration.

RADOS Bench


The next frontier is I/O performance on a wider range of workloads, including small-operation dominated workloads, and end-to-end I/O performance from actual clients.

Over the last few weeks, we’ve begun to get results from the Ceph RADOS benchmarking tool rados bench, with the latest XioMessenger RDMA transport, and the outlook is promising:

There’s a lot more to do, more links to optimize in the Cohort I/O pipeline, and also new work on Cohort backing stores suitable for real storage applications. We’ll be posting updates on our progress!