pech: A C repository from rouming

I found more comfortable to hack Ceph, analysing protocol
implementation, monitor and OSD client code reading linux kernel
sources, instead of legacy OSD C++ code or (god forbid!) Crimson
project.

Here, in the sources, the level of craziness is above all permissible
norms: I took (almost) all Ceph sources from net/ceph/ kernel path and
build them in userspace as a simplest OSD possible which has a perfect
name `Pech` (Germans speakers will understand).

I do not use threads, I use cooperative scheduling and jump from task
contexts using setjmp()/longjmp() calls. This model perfectly fits UP
kernel with disabled preemption, thus adapted workqueue.c and timer.c
code runs the event loop.

So again, no atomic operations, no locks, everything is one thread.
In future number of event loops can be equal to a number of physical
CPUs, where each event loop is executed from a dedicated pthread
context and pinned to a particular CPU.

What Pech OSD does?

o Connects to monitors and "boots" OSD, i.e. marks it as UP.
o On Ctrl+C marks OSD on monitors as DOWN and gracefully exits.
o Requests supported by Pech OSD:

Replication requets between OSDs:

- OSD_REPOP
- OSD_REPOPREPLY

OSD operations supported in memory:

- OP_WRITE
- OP_WRITEFULL
- OP_READ
- OP_SYNC_READ
- OP_STAT
- OP_CALL
- OP_OMAPGETVALS
- OP_OMAPGETVALSBYKEYS
- OP_OMAPSETVALS
- OP_OMAPGETKEYS
- OP_GETXATTR
- OP_SETXATTR
- OP_CREATE

So simple fio/examples/rados.fio load can be run.

For RBD images (e.g. fio/examples/rbd.fio) OSD class directory should
be specified, i.e. where is your upstream Ceph built is located:
`--class_dir $CEPH/build/lib`.

The thing is that RBD images require loading of OSD object classes,
which are shared objects and loaded by the OSD on OP_CALL request.
The original Ceph OSD object classes can't be simply loaded from
Pech, because it is written on C and does not provide any C++
interfaces or Ceph common libraries, like for example bufferlist.
In order to provide missing C++ classes and C++ interface we need a
proxy library, which acts as a bridge between Pech OSD and OSD
object classes. The Pech proxy library can be found here [2], so
`make ceph_pech_proxy` should be called in order to build it. When
everything is built and `class_dir` is specified rbd.fio load should
work just fine.

Currently two replication models are supported: "primary-copy",
"chain" (see further "replication" argument description). Pech OSD
does not support any failover (at least yet), so consider current
state of replication as "childish". But Pech OSD already follows
transaction semantics and all object mutation operations are
replicated on other OSDs in one transaction request. This can be
already useful and can be tested on a real cluster.

There is also one replication mode which is called "client-based"
replication, namely client is responsible for data replication.
Pech OSD checks any of CEPH_MSG_OSD_OP request on existence of
CEPH_OSD_FLAG_DONT_REPLICATE flag, and if any - Pech OSD mutates
objects locally without attempt to replicate and replies the result
back. In this way, client should be responsible for marking request
with DONT_REPLICATE flag and doing the replication.

What is not yet ported from kernel sources?

o Crypto part is noop, thus monitors should be run with auth=none.
To make cephx work direct copy-paste of kernel crypto sources
has to be done, or a wrapper over openssl lib should be written,
see src/ceph/crypto.c interface stubs for details.

What is the Great Idea behind?

o I need easy hackable OSD in C with IO path only, without failover,
log-based replication, PG layer and all other things. I want to
test different replication strategies (client-based, primary-copy,
chain) having simplest and fastest file storage (yes, step back to
filestore) which reads and writes directly to files.

Eventually this Pech OSD can be a starting point to something
different, something which is not RADOS, which is fast, with
minimum IO ordering requirements and acts as a RAID 1 cluster,
e.g. something which is described here [1].

Make:

$ make -j8

Start new Ceph cluster with 1 OSD and then stop everything. We start
monitors on specified port, proto v1 and with -X option, i.e. auth=none.

$ CEPH_PORT=50000 MON=1 MDS=0 OSD=1 MGR=0 ../src/vstart.sh --memstore \
-n -X --msgr1
$ ../src/stop.sh

Restart only Ceph monitor(s):

$ MON=1 MDS=0 OSD=0 MGR=0 ../src/vstart.sh

Start pech-osd accessing monitor over v1 protocol:

$ OSD=0; ./pech-osd --mon_addrs ip.ip.ip.ip:50001 --server_ip=0.0.0.0 \
--name $OSD --fsid `cat ./osd$OSD/fsid` --log_level 5

For the case when RBD images are required Pech proxy should be built
and `class_dir` should be specified:

$ OSD=0; ./pech-osd --mon_addrs ip.ip.ip.ip:50001 --server_ip 0.0.0.0 \
--name $OSD --fsid `cat ./osd$OSD/fsid` \
--class_dir $CEPH/build/lib --log_level 5

For DEBUG purposes maximum output log level can be specified: --log_level 7

Replication model can be specified with "replication" argument, for example:

$ OSD=0; ./pech-osd --mon_addrs ip.ip.ip.ip:50001 --server_ip 0.0.0.0 \
--name $OSD --fsid `cat ./osd$OSD/fsid` \
--class_dir $CEPH/build/lib --log_level 5 --replication primary-copy

so "primary-copy" or "chain" are supported for now.

Beware: "client-based" replication model should be enabled on client
side, in other words a client is responsible for marking the
OP_WRITE or OP_WRITEFULL request with a special flag and then
forward it to all replicas. In order to test network
throughput on client-based replication a modified rbd kernel
module can be taken [3] and then device can be mapped with a
special option: "-o client_based_replication":

rbd device map rbd/fio_test -o nocrc -o client_based_replication

rbd tool requires a small patch which is here [2].

Have fun!

--
Roman

[1] https://lists.ceph.io/hyperkitty/list/dev@ceph.io/thread/N46NR7NBHWBQL4B2ASU7Y2LMKZZPK3IX/
[2] https://github.com/rouming/ceph/tree/pech-osd
[3] https://github.com/rouming/linux/tree/akpm--ceph-client-based-replication

rouming/pech