/urdma

Verbs on DPDK

Primary LanguageC

This repository contains the DPDK verbs source along with some demo
applications.

Requirements
------------

 - Linux kernel >= 3.17.8
 - gcc >= 4.9 or a compiler which supports the C11 standard
 - libibverbs >= 1.2.0 (versions less than 1.1.8 do not support verbs extensions
   and version 1.1.8 has a broken extensions ABI)
 - librdmacm >= 1.0.21
 - DPDK >= 16.07 built as shared libraries (CONFIG_RTE_BUILD_SHARED_LIB=y),
   and the rte_kni kernel module built for the currently running kernel
 - libnl-3 and libnl-route-3
 - json-c
 - uthash (included in Fedora, Ubuntu, openSUSE, and EPEL for RHEL
   repositories)

Setup instructions
------------------

I am assuming a default installation of Ubuntu 16.10.  Other modern
distributions should also work, but you will need to install DPDK
yourself in order to build the required rte_kni kernel modules which
most other distributions do not ship in their repositories.

To build this package, RTE_SDK and RTE_TARGET need to be exported into
the environment.  Using the packages in the Ubuntu 16.10 repositories,
you can run the following:

    $ source /usr/share/dpdk/dpdk-sdk-env.sh

Or for a manual DPDK installation:

    $ export RTE_SDK=${prefix}/share/dpdk
    $ export RTE_TARGET=x86_64-native-linuxapp-gcc

If you are pulling this from a fresh git clone, first run:

    $ autoreconf -i

Then this follows the normal autotools-style build:

    $ ./configure --sysconfdir /etc
    $ make
    $ sudo make install

Note that sysconfdir must match that of your libibverbs installation, in
order for the verbs library to find the urdma driver.

The configure script will look for your kernel source directory in the
typical location by default (/lib/modules/`uname -r`/source).  You can
set the KERNELDIR environment variable to specify a different location;
for example, if you are building against a different kernel version than
what it installed locally.

To run an application with this driver, the KNI and urdma modules must be loaded:

    $ sudo modprobe rte_kni
    $ sudo modprobe urdma

You will need to create a file ${sysconfdir}/rdma/urdma.json that looks
like this, with appropriate values substituted in:

    { "ports": [ { "ipv4_address": "10.2.0.100" } ],
      "socket": "/run/urdma.sock",
      "eal_args": { "log-level": 7 }
    }

You can validate your config file against doc/urdma-schema.json using
any JSON schema validator; python-jsonschema which comes with Ubuntu is
one possibility. Note that this schema file is stricter than what is
actually allowed at runtime; at runtime, additional properties are
simply ignored but the schema file does not allow them; this is to make
typos more obvious.

Finally, the urdmad service must be running:

    $ systemctl --user start urdmad

This will cause devices to appear in your "ip link" output, and cause
uverbs devices to appear in /sys/class/infiniband_verbs.

Or you can manually run urdmad as root; in this case you will need to
run every verbs application as root to pick up the DPDK configuration
which cannot be relocated.

Known Issues
------------

 - Shared receive queues (SRQs) are currently not supported.  In order
   to run openmpi over urdma, you will need to specify the following
   command line options to prevent openmpi from using shared receive
   queues, and to disable a warning that it doesn't know about the
   device vendor ID:

       $ mpirun --mca btl_openib_warn_no_device_params_found 0 \
                --mca btl_openib_receive_queues P,65536,256,192,128 \
		${mpi_app} ${mpi_app_args}...

 - If running DPDK sample applications succeeds, but running urdmad
   fails with "Configuration expects N devices but found only 0", the
   DPDK PMD drivers are probably not being loaded at runtime by default.
   If this is the case then you will probably need to load them by hand,
   by adding an argument to urdmad like:

     urdmad -d ${RTE_SDK}/${RTE_TARGET}/lib/libpmd_net_i40e.so.1

   and adding the equivalent to urdma.json:

     "eal_args": { /* ... */, "d": "..." }

   To avoid this, you can set CONFIG_RTE_EAL_PMD_PATH to a directory
   like /usr/local/lib/dpdk-pmds when building DPDK, and the place the
   PMD .so files into that directory after DPDK is installed. DPDK will
   then auto-load all .so files in that directory as PMD libraries.

 - DPDK 17.05 introduced the concept of mempool drivers. If you see a
   message like this using DPDK >= 17.05:

     MBUF: error setting mempool handler
     EAL: Error - exiting with code: 1
     Cause: Cannot create rx mempool for port 0 with 8064 mbufs: Invalid argument

   then you do not have a mempool driver by default. This occurs
   because the linker has been configured with --as-needed by default on
   some distributions, and since the mempool and PMD drivers do not
   declare any symbols, the linker has no way of knowing that we depend
   on the presence of a mempool driver.

   Like the PMD issue above, you can pass -d to load the default
   librte_mempool_ring driver on the command line, or set
   CONFIG_RTE_EAL_PMD_PATH and place mempool driver(s) into that
   directory.

 - There is a potential race condition with completion channels, where a
   completion event can get lost, and thus a thread waiting on
   ibv_get_cq_event() will never wake up, leading to a deadlock.  A cause has
   not been identified, but the issue has not been reproduced with the "extra"
   lock added around the rte_ring operations in do_poll_cq() and
   finish_post_cqe().

 - There is the possibility of a hang in the kernel module if the user process
   is killed while between the read() and write() calls on event_fd in
   poll_conn_state().  This is because rdma_destroy_id() in the kernel will
   block until the connection attempt completes, but itself prevents our
   event_fd from being closed which would unblock it.

 - The progress thread will use 100% CPU since it must busy-poll on the KNI
   interfaces (there is no way to sleep until the process gets an event).

 - urdma follows the RFC 5040 ordering rules strictly, meaning that it
   can place data segments out of order if it receives them out of
   order. This in turn means that if two RDMA WRITE requests are made
   on overlapping buffers, urdma may place a data segment from the first
   *after* the corresponding data segment from the second, thus leading
   to a torn write from the perspective of the application. Thus
   applications must not post multiple transfer requests on overlapping
   buffers simultaneously if they depend on the data ordering.