CohortFS OSD hits 6,688 MB/s randwrite to memory with network-bypass transport
As the CohortFS team sprints toward its first public software release (Rel0), the Cohort OSD developers are focusing intense effort on performance of the OSD I/O pipeline.
A CohortFS OSD shares a lot of code with a Ceph OSD, including network transports and run-time selectable backing stores (Memory, OS Filesystem, Key/Value). But there’s a lot of very different structure and behavior, driven by Cohort’s volume architecture and semantics. In CohortFS, a volume is plug-out data set abstraction, and Cohort data placement, I/O semantics and I/O pipeline vary by volume type. So in a Cohort cluster, deterministic, pseudo-random Ceph-style RUSH and CRUSH data placement volumes could sit side-by-side with CohortFS-style client erasure-coded volumes doing high-performance RAID-on-the-net.
For Rel0, the Cohort team is focused on optimizing performance for workloads that traditional Ceph doesn’t do well, using a bottom-up approach of streamlining the I/O pipeline, and re-working abstractions that limit current Ceph’s sequential or small random I/O operation performance and bandwidth.
The work Cohort is best known for in the Ceph community is an RDMA transport prototype we built in cooperation with Mellanox that gets 750K IOPs / 6000 MB/s bandwidth in standalone testing. The goal for Rel0 is to deliver as much of XioMessenger’s available performance all the way to the Cohort OSD backing store as possible. We’re still a ways away from that goal, but we’re making some great progress.
In real systems we're normally concerned with reading and writing to persistent storage, which has relatively high latency, but in optimization we frequently wish to understand available performance unrestricted by the latencies of disks or even SSDs. The best available performance in a Cohort (or Ceph) backing store is when the OSD is reading and writing straight to memory, with no file system or database overhead. The Cohort version of Ceph’s memory backing store started off with a bandwidth rate limit of around 500 MB/s on random write workloads to a few objects. By summer 2014, we had that number up to ~8000 MB/s, close to the maximum available performance on our test systems.
The Cohort developers removed a number of performance limitations in its version of MemStore. To do it, we constructed new, shortened I/O pipelines from workload generation clients. The most important mechanisms we developed for workload simulation are DirectMessenger and a new re-entrant version of the Cohort OSD, which let an instance of the fio workload-generation client run I/O workloads directly against the Cohort OSD MemStore (or other ObjectStore), bypassing the RBD and RADOS client subsystems and the ordinary network transport.
The big result from the last few weeks of Cohort OSD development is some great performance numbers from libosd/DirectMessenger on an fio random write workload to memory backing, which reliably reaches almost 7000 MB/s with 256K blocks. The upshot is that the bandwidth performance of Cohort OSD’s core request processing abstraction better than matches the available bandwidth of XioMessenger on 56-gigabit Infiniband, and so imposes no rate limit on end-to-end bandwidth in this configuration.
The next frontier is I/O performance on a wider range of workloads, including small-operation dominated workloads, and end-to-end I/O performance from actual clients.
Over the last few weeks, we’ve begun to get results from the Ceph RADOS benchmarking tool rados bench, with the latest XioMessenger RDMA transport, and the outlook is promising:
There’s a lot more to do, more links to optimize in the Cohort I/O pipeline, and also new work on Cohort backing stores suitable for real storage applications. We’ll be posting updates on our progress!