I found more comfortable to hack Ceph, analysing protocol implementation, monitor and OSD client code reading linux kernel sources, instead of legacy OSD C++ code or (god forbid!) Crimson project. Here, in the sources, the level of craziness is above all permissible norms: I took (almost) all Ceph sources from net/ceph/ kernel path and build them in userspace as a simplest OSD possible which has a perfect name `Pech` (Germans speakers will understand). I do not use threads, I use cooperative scheduling and jump from task contexts using setjmp()/longjmp() calls. This model perfectly fits UP kernel with disabled preemption, thus adapted workqueue.c and timer.c code runs the event loop. So again, no atomic operations, no locks, everything is one thread. In future number of event loops can be equal to a number of physical CPUs, where each event loop is executed from a dedicated pthread context and pinned to a particular CPU. What Pech OSD does? o Connects to monitors and "boots" OSD, i.e. marks it as UP. o On Ctrl+C marks OSD on monitors as DOWN and gracefully exits. o Requests supported by Pech OSD: Replication requets between OSDs: - OSD_REPOP - OSD_REPOPREPLY OSD operations supported in memory: - OP_WRITE - OP_WRITEFULL - OP_READ - OP_SYNC_READ - OP_STAT - OP_CALL - OP_OMAPGETVALS - OP_OMAPGETVALSBYKEYS - OP_OMAPSETVALS - OP_OMAPGETKEYS - OP_GETXATTR - OP_SETXATTR - OP_CREATE So simple fio/examples/rados.fio load can be run. For RBD images (e.g. fio/examples/rbd.fio) OSD class directory should be specified, i.e. where is your upstream Ceph built is located: `--class_dir $CEPH/build/lib`. The thing is that RBD images require loading of OSD object classes, which are shared objects and loaded by the OSD on OP_CALL request. The original Ceph OSD object classes can't be simply loaded from Pech, because it is written on C and does not provide any C++ interfaces or Ceph common libraries, like for example bufferlist. In order to provide missing C++ classes and C++ interface we need a proxy library, which acts as a bridge between Pech OSD and OSD object classes. The Pech proxy library can be found here [2], so `make ceph_pech_proxy` should be called in order to build it. When everything is built and `class_dir` is specified rbd.fio load should work just fine. Currently two replication models are supported: "primary-copy", "chain" (see further "replication" argument description). Pech OSD does not support any failover (at least yet), so consider current state of replication as "childish". But Pech OSD already follows transaction semantics and all object mutation operations are replicated on other OSDs in one transaction request. This can be already useful and can be tested on a real cluster. There is also one replication mode which is called "client-based" replication, namely client is responsible for data replication. Pech OSD checks any of CEPH_MSG_OSD_OP request on existence of CEPH_OSD_FLAG_DONT_REPLICATE flag, and if any - Pech OSD mutates objects locally without attempt to replicate and replies the result back. In this way, client should be responsible for marking request with DONT_REPLICATE flag and doing the replication. What is not yet ported from kernel sources? o Crypto part is noop, thus monitors should be run with auth=none. To make cephx work direct copy-paste of kernel crypto sources has to be done, or a wrapper over openssl lib should be written, see src/ceph/crypto.c interface stubs for details. What is the Great Idea behind? o I need easy hackable OSD in C with IO path only, without failover, log-based replication, PG layer and all other things. I want to test different replication strategies (client-based, primary-copy, chain) having simplest and fastest file storage (yes, step back to filestore) which reads and writes directly to files. Eventually this Pech OSD can be a starting point to something different, something which is not RADOS, which is fast, with minimum IO ordering requirements and acts as a RAID 1 cluster, e.g. something which is described here [1]. Make: $ make -j8 Start new Ceph cluster with 1 OSD and then stop everything. We start monitors on specified port, proto v1 and with -X option, i.e. auth=none. $ CEPH_PORT=50000 MON=1 MDS=0 OSD=1 MGR=0 ../src/vstart.sh --memstore \ -n -X --msgr1 $ ../src/stop.sh Restart only Ceph monitor(s): $ MON=1 MDS=0 OSD=0 MGR=0 ../src/vstart.sh Start pech-osd accessing monitor over v1 protocol: $ OSD=0; ./pech-osd --mon_addrs ip.ip.ip.ip:50001 --server_ip=0.0.0.0 \ --name $OSD --fsid `cat ./osd$OSD/fsid` --log_level 5 For the case when RBD images are required Pech proxy should be built and `class_dir` should be specified: $ OSD=0; ./pech-osd --mon_addrs ip.ip.ip.ip:50001 --server_ip 0.0.0.0 \ --name $OSD --fsid `cat ./osd$OSD/fsid` \ --class_dir $CEPH/build/lib --log_level 5 For DEBUG purposes maximum output log level can be specified: --log_level 7 Replication model can be specified with "replication" argument, for example: $ OSD=0; ./pech-osd --mon_addrs ip.ip.ip.ip:50001 --server_ip 0.0.0.0 \ --name $OSD --fsid `cat ./osd$OSD/fsid` \ --class_dir $CEPH/build/lib --log_level 5 --replication primary-copy so "primary-copy" or "chain" are supported for now. Beware: "client-based" replication model should be enabled on client side, in other words a client is responsible for marking the OP_WRITE or OP_WRITEFULL request with a special flag and then forward it to all replicas. In order to test network throughput on client-based replication a modified rbd kernel module can be taken [3] and then device can be mapped with a special option: "-o client_based_replication": rbd device map rbd/fio_test -o nocrc -o client_based_replication rbd tool requires a small patch which is here [2]. Have fun! -- Roman [1] https://lists.ceph.io/hyperkitty/list/dev@ceph.io/thread/N46NR7NBHWBQL4B2ASU7Y2LMKZZPK3IX/ [2] https://github.com/rouming/ceph/tree/pech-osd [3] https://github.com/rouming/linux/tree/akpm--ceph-client-based-replication