/pech

Just Pech or Ceph OSD in C based on linux kernel sources compiled in userspace

Primary LanguageC

I found more comfortable to hack Ceph, analysing protocol
implementation, monitor and OSD client code reading linux kernel
sources, instead of legacy OSD C++ code or (god forbid!)  Crimson
project.

Here, in the sources, the level of craziness is above all permissible
norms: I took (almost) all Ceph sources from net/ceph/ kernel path and
build them in userspace as a simplest OSD possible which has a perfect
name `Pech` (Germans speakers will understand).

I do not use threads, I use cooperative scheduling and jump from task
contexts using setjmp()/longjmp() calls. This model perfectly fits UP
kernel with disabled preemption, thus adapted workqueue.c and timer.c
code runs the event loop.

So again, no atomic operations, no locks, everything is one thread.
In future number of event loops can be equal to a number of physical
CPUs, where each event loop is executed from a dedicated pthread
context and pinned to a particular CPU.

What Pech OSD does?

  o Connects to monitors and "boots" OSD, i.e. marks it as UP.
  o On Ctrl+C marks OSD on monitors as DOWN and gracefully exits.
  o Requests supported by Pech OSD:

     Replication requets between OSDs:

     - OSD_REPOP
     - OSD_REPOPREPLY

     OSD operations supported in memory:

     - OP_WRITE
     - OP_WRITEFULL
     - OP_READ
     - OP_SYNC_READ
     - OP_STAT
     - OP_CALL
     - OP_OMAPGETVALS
     - OP_OMAPGETVALSBYKEYS
     - OP_OMAPSETVALS
     - OP_OMAPGETKEYS
     - OP_GETXATTR
     - OP_SETXATTR
     - OP_CREATE

  So simple fio/examples/rados.fio load can be run.

  For RBD images (e.g. fio/examples/rbd.fio) OSD class directory should
  be specified, i.e. where is your upstream Ceph built is located:
  `--class_dir $CEPH/build/lib`.

  The thing is that RBD images require loading of OSD object classes,
  which are shared objects and loaded by the OSD on OP_CALL request.
  The original Ceph OSD object classes can't be simply loaded from
  Pech, because it is written on C and does not provide any C++
  interfaces or Ceph common libraries, like for example bufferlist.
  In order to provide missing C++ classes and C++ interface we need a
  proxy library, which acts as a bridge between Pech OSD and OSD
  object classes.  The Pech proxy library can be found here [2], so
  `make ceph_pech_proxy` should be called in order to build it. When
  everything is built and `class_dir` is specified rbd.fio load should
  work just fine.

  Currently two replication models are supported: "primary-copy",
  "chain" (see further "replication" argument description). Pech OSD
  does not support any failover (at least yet), so consider current
  state of replication as "childish". But Pech OSD already follows
  transaction semantics and all object mutation operations are
  replicated on other OSDs in one transaction request. This can be
  already useful and can be tested on a real cluster.

  There is also one replication mode which is called "client-based"
  replication, namely client is responsible for data replication.
  Pech OSD checks any of CEPH_MSG_OSD_OP request on existence of
  CEPH_OSD_FLAG_DONT_REPLICATE flag, and if any - Pech OSD mutates
  objects locally without attempt to replicate and replies the result
  back. In this way, client should be responsible for marking request
  with DONT_REPLICATE flag and doing the replication.

What is not yet ported from kernel sources?

  o Crypto part is noop, thus monitors should be run with auth=none.
    To make cephx work direct copy-paste of kernel crypto sources
    has to be done, or a wrapper over openssl lib should be written,
    see src/ceph/crypto.c interface stubs for details.

What is the Great Idea behind?

  o I need easy hackable OSD in C with IO path only, without failover,
    log-based replication, PG layer and all other things. I want to
    test different replication strategies (client-based, primary-copy,
    chain) having simplest and fastest file storage (yes, step back to
    filestore) which reads and writes directly to files.

    Eventually this Pech OSD can be a starting point to something
    different, something which is not RADOS, which is fast, with
    minimum IO ordering requirements and acts as a RAID 1 cluster,
    e.g. something which is described here [1].

Make:

  $ make -j8

Start new Ceph cluster with 1 OSD and then stop everything.  We start
monitors on specified port, proto v1 and with -X option, i.e. auth=none.

  $ CEPH_PORT=50000 MON=1 MDS=0 OSD=1 MGR=0 ../src/vstart.sh --memstore \
    -n -X --msgr1
  $ ../src/stop.sh

Restart only Ceph monitor(s):

  $ MON=1 MDS=0 OSD=0 MGR=0 ../src/vstart.sh

Start pech-osd accessing monitor over v1 protocol:

  $ OSD=0; ./pech-osd --mon_addrs ip.ip.ip.ip:50001 --server_ip=0.0.0.0 \
                --name $OSD --fsid `cat ./osd$OSD/fsid` --log_level 5

For the case when RBD images are required Pech proxy should be built
and `class_dir` should be specified:

  $ OSD=0; ./pech-osd --mon_addrs ip.ip.ip.ip:50001 --server_ip 0.0.0.0 \
                --name $OSD --fsid `cat ./osd$OSD/fsid` \
                --class_dir $CEPH/build/lib --log_level 5

For DEBUG purposes maximum output log level can be specified: --log_level 7

Replication model can be specified with "replication" argument, for example:

  $ OSD=0; ./pech-osd --mon_addrs ip.ip.ip.ip:50001 --server_ip 0.0.0.0 \
                --name $OSD --fsid `cat ./osd$OSD/fsid` \
                --class_dir $CEPH/build/lib --log_level 5 --replication primary-copy

so "primary-copy" or "chain" are supported for now.

Beware: "client-based" replication model should be enabled on client
        side, in other words a client is responsible for marking the
        OP_WRITE or OP_WRITEFULL request with a special flag and then
        forward it to all replicas. In order to test network
        throughput on client-based replication a modified rbd kernel
        module can be taken [3] and then device can be mapped with a
        special option: "-o client_based_replication":

           rbd device map rbd/fio_test -o nocrc -o client_based_replication

        rbd tool requires a small patch which is here [2].


Have fun!

--
Roman

[1] https://lists.ceph.io/hyperkitty/list/dev@ceph.io/thread/N46NR7NBHWBQL4B2ASU7Y2LMKZZPK3IX/
[2] https://github.com/rouming/ceph/tree/pech-osd
[3] https://github.com/rouming/linux/tree/akpm--ceph-client-based-replication