This repository contains the DPDK verbs source along with some demo applications. Requirements ------------ - Linux kernel >= 3.17.8 - gcc >= 4.9 or a compiler which supports the C11 standard - libibverbs >= 1.2.0 (versions less than 1.1.8 do not support verbs extensions and version 1.1.8 has a broken extensions ABI) - librdmacm >= 1.0.21 - DPDK >= 16.07 built as shared libraries (CONFIG_RTE_BUILD_SHARED_LIB=y), and the rte_kni kernel module built for the currently running kernel - libnl-3 and libnl-route-3 - json-c - uthash (included in Fedora, Ubuntu, openSUSE, and EPEL for RHEL repositories) Setup instructions ------------------ I am assuming a default installation of Ubuntu 16.10. Other modern distributions should also work, but you will need to install DPDK yourself in order to build the required rte_kni kernel modules which most other distributions do not ship in their repositories. To build this package, RTE_SDK and RTE_TARGET need to be exported into the environment. Using the packages in the Ubuntu 16.10 repositories, you can run the following: $ source /usr/share/dpdk/dpdk-sdk-env.sh Or for a manual DPDK installation: $ export RTE_SDK=${prefix}/share/dpdk $ export RTE_TARGET=x86_64-native-linuxapp-gcc If you are pulling this from a fresh git clone, first run: $ autoreconf -i Then this follows the normal autotools-style build: $ ./configure --sysconfdir /etc $ make $ sudo make install Note that sysconfdir must match that of your libibverbs installation, in order for the verbs library to find the urdma driver. The configure script will look for your kernel source directory in the typical location by default (/lib/modules/`uname -r`/source). You can set the KERNELDIR environment variable to specify a different location; for example, if you are building against a different kernel version than what it installed locally. To run an application with this driver, the KNI and urdma modules must be loaded: $ sudo modprobe rte_kni $ sudo modprobe urdma You will need to create a file ${sysconfdir}/rdma/urdma.json that looks like this, with appropriate values substituted in: { "ports": [ { "ipv4_address": "10.2.0.100" } ], "socket": "/run/urdma.sock", "eal_args": { "log-level": 7 } } You can validate your config file against doc/urdma-schema.json using any JSON schema validator; python-jsonschema which comes with Ubuntu is one possibility. Note that this schema file is stricter than what is actually allowed at runtime; at runtime, additional properties are simply ignored but the schema file does not allow them; this is to make typos more obvious. Finally, the urdmad service must be running: $ systemctl --user start urdmad This will cause devices to appear in your "ip link" output, and cause uverbs devices to appear in /sys/class/infiniband_verbs. Or you can manually run urdmad as root; in this case you will need to run every verbs application as root to pick up the DPDK configuration which cannot be relocated. Known Issues ------------ - Shared receive queues (SRQs) are currently not supported. In order to run openmpi over urdma, you will need to specify the following command line options to prevent openmpi from using shared receive queues, and to disable a warning that it doesn't know about the device vendor ID: $ mpirun --mca btl_openib_warn_no_device_params_found 0 \ --mca btl_openib_receive_queues P,65536,256,192,128 \ ${mpi_app} ${mpi_app_args}... - If running DPDK sample applications succeeds, but running urdmad fails with "Configuration expects N devices but found only 0", the DPDK PMD drivers are probably not being loaded at runtime by default. If this is the case then you will probably need to load them by hand, by adding an argument to urdmad like: urdmad -d ${RTE_SDK}/${RTE_TARGET}/lib/libpmd_net_i40e.so.1 and adding the equivalent to urdma.json: "eal_args": { /* ... */, "d": "..." } To avoid this, you can set CONFIG_RTE_EAL_PMD_PATH to a directory like /usr/local/lib/dpdk-pmds when building DPDK, and the place the PMD .so files into that directory after DPDK is installed. DPDK will then auto-load all .so files in that directory as PMD libraries. - DPDK 17.05 introduced the concept of mempool drivers. If you see a message like this using DPDK >= 17.05: MBUF: error setting mempool handler EAL: Error - exiting with code: 1 Cause: Cannot create rx mempool for port 0 with 8064 mbufs: Invalid argument then you do not have a mempool driver by default. This occurs because the linker has been configured with --as-needed by default on some distributions, and since the mempool and PMD drivers do not declare any symbols, the linker has no way of knowing that we depend on the presence of a mempool driver. Like the PMD issue above, you can pass -d to load the default librte_mempool_ring driver on the command line, or set CONFIG_RTE_EAL_PMD_PATH and place mempool driver(s) into that directory. - There is a potential race condition with completion channels, where a completion event can get lost, and thus a thread waiting on ibv_get_cq_event() will never wake up, leading to a deadlock. A cause has not been identified, but the issue has not been reproduced with the "extra" lock added around the rte_ring operations in do_poll_cq() and finish_post_cqe(). - There is the possibility of a hang in the kernel module if the user process is killed while between the read() and write() calls on event_fd in poll_conn_state(). This is because rdma_destroy_id() in the kernel will block until the connection attempt completes, but itself prevents our event_fd from being closed which would unblock it. - The progress thread will use 100% CPU since it must busy-poll on the KNI interfaces (there is no way to sleep until the process gets an event). - urdma follows the RFC 5040 ordering rules strictly, meaning that it can place data segments out of order if it receives them out of order. This in turn means that if two RDMA WRITE requests are made on overlapping buffers, urdma may place a data segment from the first *after* the corresponding data segment from the second, thus leading to a torn write from the perspective of the application. Thus applications must not post multiple transfer requests on overlapping buffers simultaneously if they depend on the data ordering.