Kernel Mode RDMA Ping Module Steve Wise - 8/2009 --- Updated 8/2016 --- ============ Introduction ============ The krping module is a kernel loadable module that utilizes the Open Fabrics verbs to implement a client/server ping/pong program. The module was implemented as a test vehicle for working with the iwarp branch of the OFA project. The goals of this program include: - Simple harness to test kernel-mode verbs: connection setup, send, recv, rdma read, rdma write, and completion notifications. - Client/server model. - IP addressing used to identify remote peer. - Transport independent utilizing the RDMA CMA service - No user-space application needed. - Just a test utility...nothing more. This module allows establishing connections and running ping/pong tests via a /proc entry called /proc/krping. This simple mechanism allows starting many kernel threads concurrently and avoids the need for a user space application. The krping module is designed to utilize all the major DTO operations: send, recv, rdma read, and rdma write. Its goal was to test the API and as such is not necessarily an efficient test. Once the connection is established, the client and server begin a ping/pong loop: Client Server --------------------------------------------------------------------- SEND(ping source buffer rkey/addr/len) RECV Completion with ping source info RDMA READ from client source MR RDMA Read completion SEND .go ahead. to client RECV Completion of .go ahead. SEND (ping sink buffer rkey/addr/len) RECV Completion with ping sink info RDMA Write to client sink MR RDMA Write completion SEND .go ahead. to client RECV Completion of .go ahead. Validate data in source and sink buffers <repeat the above loop> ============ To build/install the krping module ============ # git clone git://git.openfabrics.org/~swise/krping # cd krping <edit Makefile and set KSRC accordingly> # make && make install # modprobe rdma_krping ============ Using Krping ============ Communication from user space is done via the /proc filesystem. Krping exports file /proc/krping. Writing commands in ascii format to /proc/krping will start krping threads in the kernel. The thread issuing the write to /proc/krping is used to run the krping test, so it will block until the test completes, or until the user interrupts the write. Here is a simple example to start an rping test using the rdma_krping module. The server's address is 192.168.69.127. The client will connect to this address at port 9999 and issue 100 ping/pong messages. This example assumes you have two systems connected via IB and the IPoverIB devices are configured on the 192.168.69/24 subnet accordingly. Server: # modprobe rdma_krping # echo "server,addr=192.168.69.127,port=9999" >/proc/krping The echo command above will block until the krping test completes, or the user hits ctrl-c. On the client: # modprobe rdma_krping # echo "client,addr=192.168.69.127,port=9999,count=100" >/proc/krping Just like on the server, the echo command above will block until the krping test completes, or the user hits ctrl-c. The syntax for krping commands is a string of options separated by commas. Options can be single keywords, or in the form: option=operand. Operands can be integers or strings. Note you must specify the _same_ options on both sides. For instance, if you want to use the server_invalidate option, then you must specify it on both the server and client command lines. Opcode Operand Type Description ------------------------------------------------------------------------ client none Initiate a client side krping thread. server none Initiate a server side krping thread. addr string The server's IP address in dotted decimal format. Note the server can use 0.0.0.0 to bind to all devices. port integer The server's port number in host byte order. count integer The number of rping iterations to perform before shutting down the test. If unspecified, the count is infinite. size integer The size of the rping data. Default for rping is 65 bytes. verbose none Enables printk()s that dump the rping data. Use with caution! validate none Enables validating the rping data on each iteration to detect data corruption. mem_mode string Determines how memory will be registered. Modes include dma, and reg. Default is dma. server_inv none Valid only in reg mr mode, this option enables invalidating the client's reg mr via SEND_WITH_INVALIDATE messages from the server. local_dma_lkey none Use the local dma lkey for the source of writes and sends, and in recvs read_inv none Server will use READ_WITH_INV. Only valid in reg mem_mode. ============ Memory Usage: ============ The krping client uses 4 memory areas: start_buf - the source of the ping data. This buffer is advertised to the server at the start of each iteration, and the server rdma reads the ping data from this buffer over the wire. rdma_buf - the sink of the ping data. This buffer is advertised to the server each iteration, and the server rdma writes the ping data that it read from the start buffer into this buffer. The start_buf and rdma_buf contents are then compared if the krping validate option is specified. recv_buf - used to recv "go ahead" SEND from the server. send_buf - used to advertise the rdma buffers to the server via SEND messages. The krping server uses 3 memory areas: rdma_buf - used as the sink of the RDMA READ to pull the ping data from the client, and then used as the source of an RDMA WRITE to push the ping data back to the client. recv_buf - used to receive rdma rkey/addr/length advertisements from the client. send_buf - used to send "go ahead" SEND messages to the client. ============ Memory Registration Modes: ============ Each of these memory areas are registered with the RDMA device using whatever memory mode was specified in the command line. The mem_mode values include: dma, and reg (aka fastreg). The default mode, if not specified, is dma. The dma mem_mode uses a single dma_mr for all memory buffers. The reg mem_mode uses a reg mr on the client side for the start_buf and rdma_buf buffers. Each time the client will advertise one of these buffers, it invalidates the previous registration and fast registers the new buffer with a new key. If the server_invalidate option is on, then the server will do the invalidation via the "go ahead" messages using the IB_WR_SEND_WITH_INV opcode. Otherwise the client invalidates the mr using the IB_WR_LOCAL_INV work request. On the server side, reg mem_mode causes the server to use the reg_mr rkey for its rdma_buf buffer IO. Before each rdma read and rdma write, the server will post an IB_WR_LOCAL_INV + IB_WR_REG_MR WR chain to register the buffer with a new key. If the krping read-inv option is set then the server will use IB_WR_READ_WITH_INV to do the rdma read and skip the IB_WR_LOCAL_INV wr before re-registering the buffer for the subsequent rdma write operation. ============ Stats ============ While krping threads are executing, you can obtain statistics on the thread by reading from the /proc/krping file. If you cat /proc/krping, you will dump IO statistics for each running krping thread. The format is one thread per line, and each thread contains the following stats separated by white spaces: Statistic Description --------------------------------------------------------------------- Name krping thread number and device being used. Send Bytes Number of bytes transferred in SEND WRs. Send Messages Number of SEND WRs posted Recv Bytes Number of bytes received via RECV completions. Recv Messages Number of RECV WRs completed. RDMA WRITE Bytes Number of bytes transferred in RDMA WRITE WRs. RDMA WRITE Messages Number of RDMA WRITE WRs posted. RDMA READ Bytes Number of bytes transferred via RDMA READ WRs. RDMA READ Messages Number of RDMA READ WRs posted. Here is an example of the server side output for 5 krping threads: # cat /proc/krping 1-amso0 0 0 16 1 12583960576 192016 0 0 2-mthca0 0 0 16 1 60108570624 917184 0 0 3-mthca0 0 0 16 1 59106131968 901888 0 0 4-mthca1 0 0 16 1 101658394624 1551184 0 0 5-mthca1 0 0 16 1 100201922560 1528960 0 0 # ============ EXPERIMENTAL ============ There are other options that enable micro benchmarks to measure the kernel rdma performance. These include: Opcode Operand Type Description ------------------------------------------------------------------------ wlat none Write latency test rlat none read latency test poll none enable polling vs blocking for rlat bw none write throughput test duplex none valid only with bw, this enables bidirectional mode tx-depth none set the sq depth for bw tests See the awkit* files to take the data logged in the kernel log and compute RTT/2 or Gbps results. Use these at your own risk. END-OF-FILE