zrlio/disni

Problem with RDMAvsTcpBenchmark of DiSNI over SoftRoCE

faramir opened this issue · 11 comments

Hello,

After a hard and long time spent trying to compile Soft-iWARP (without success), I found a module RXE (kernel: 4.9.0-8-amd64) and Soft-RoCE that works without any problems (sudo apt-get -t stretch-backports install rdma-core ibverbs-providers ibverbs-utils libibverbs-dev librdmacm-dev).

I've successfully compiled and installed DiSNI on the virtual machines (Debian on VirtualBox).

The basic, simple example (com.ibm.disni.examples.SendRecv*) works like a charm.
However, I have problem with com.ibm.disni.benchmarks.RDMAvsTcpBenchmark*

Sometimes it finishes as it should, sometimes it hangs. I've added some diagnostic outputs in the classes, and sometimes it hangs after 2000 iterations of the RDMA loop, sometimes after 1942 iterations and so on.
Then I have also changed the value that is being send. Client sends current iteration number, server sends the minus current iteration number. The received value is not always the next expected value (eg. 0, 1, 1, 3, 3, 4, ... or -1, -2, -4, -5, -5, -6, -7, -8, -9, ...)

I does not have a clue, where I should look what causes the problem so I'm unable to find a solution.

Best regards,
Marek

---cut---

server-101$ java -cp disni-2.0-jar-with-dependencies.jar:disni-tests.jar com.ibm.disni.benchmarks.RDMAvsTcpBenchmarkServer -a 192.168.56.101 -k 3000
client-102$ java -cp disni-2.0-jar-with-dependencies.jar:disni-tests.jar com.ibm.disni.benchmarks.RDMAvsTcpBenchmarkClient -a 192.168.56.101 -k 3000

...

jstack prints (GC and other JVM threads ommited) on the server:
"main" #1 prio=5 os_prio=0 tid=0x00007f577800a800 nid=0x773 waiting on condition [0x00007f577fa75000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000f5c061f0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:403)
        at com.ibm.disni.benchmarks.RDMAvsTcpBenchmarkServer.runRDMA(RDMAvsTcpBenchmarkServer.java:116)  # clientEndpoint.getWcEvents().take();
        at com.ibm.disni.benchmarks.RDMAvsTcpBenchmarkServer.launch(RDMAvsTcpBenchmarkServer.java:159)
        at com.ibm.disni.benchmarks.RDMAvsTcpBenchmarkServer.main(RDMAvsTcpBenchmarkServer.java:48)
"Thread-1" #10 prio=5 os_prio=0 tid=0x00007f5778189000 nid=0x784 runnable [0x00007f57575bc000]
   java.lang.Thread.State: RUNNABLE
        at com.ibm.disni.verbs.impl.NativeDispatcher._getCqEvent(Native Method)
        at com.ibm.disni.verbs.impl.RdmaVerbsNat.getCqEvent(RdmaVerbsNat.java:165)
        at com.ibm.disni.verbs.IbvCompChannel.getCqEvent(IbvCompChannel.java:77)
        at com.ibm.disni.RdmaCqProcessor.run(RdmaCqProcessor.java:120)
        at java.lang.Thread.run(Thread.java:748)
"Thread-0" #9 prio=5 os_prio=0 tid=0x00007f57781f2800 nid=0x783 runnable [0x00007f575c1c8000]
   java.lang.Thread.State: RUNNABLE
        at com.ibm.disni.verbs.impl.NativeDispatcher._getCmEvent(Native Method)
        at com.ibm.disni.verbs.impl.RdmaCmNat.getCmEvent(RdmaCmNat.java:193)
        at com.ibm.disni.verbs.RdmaEventChannel.getCmEvent(RdmaEventChannel.java:75)
        at com.ibm.disni.RdmaCmProcessor.run(RdmaCmProcessor.java:68)
        at java.lang.Thread.run(Thread.java:748)

on the client:
"main" #1 prio=5 os_prio=0 tid=0x00007f274000a800 nid=0x748 waiting on condition [0x00007f2749ece000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000f5c24ab0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:403)
        at com.ibm.disni.benchmarks.RDMAvsTcpBenchmarkClient.runRDMA(RDMAvsTcpBenchmarkClient.java:115)  # endpoint.getWcEvents().take();
        at com.ibm.disni.benchmarks.RDMAvsTcpBenchmarkClient.launch(RDMAvsTcpBenchmarkClient.java:156)
        at com.ibm.disni.benchmarks.RDMAvsTcpBenchmarkClient.main(RDMAvsTcpBenchmarkClient.java:46)
"Thread-1" #10 prio=5 os_prio=0 tid=0x00007f2740209800 nid=0x759 runnable [0x00007f26f9bf4000]
   java.lang.Thread.State: RUNNABLE
        at com.ibm.disni.verbs.impl.NativeDispatcher._getCqEvent(Native Method)
        at com.ibm.disni.verbs.impl.RdmaVerbsNat.getCqEvent(RdmaVerbsNat.java:165)
        at com.ibm.disni.verbs.IbvCompChannel.getCqEvent(IbvCompChannel.java:77)
        at com.ibm.disni.RdmaCqProcessor.run(RdmaCqProcessor.java:120)
        at java.lang.Thread.run(Thread.java:748)
"Thread-0" #9 prio=5 os_prio=0 tid=0x00007f2740202000 nid=0x758 runnable [0x00007f26f9cf5000]
   java.lang.Thread.State: RUNNABLE
        at com.ibm.disni.verbs.impl.NativeDispatcher._getCmEvent(Native Method)
        at com.ibm.disni.verbs.impl.RdmaCmNat.getCmEvent(RdmaCmNat.java:193)
        at com.ibm.disni.verbs.RdmaEventChannel.getCmEvent(RdmaEventChannel.java:75)
        at com.ibm.disni.RdmaCmProcessor.run(RdmaCmProcessor.java:68)
        at java.lang.Thread.run(Thread.java:748)

@petro-rudenko Can you take a look at this?

Interesting - reproduce on my environment. So overall barrier is working - it sending after receiving. The issue with stall data on some iterations:

RDMA: 0 < 301
recv    0
send    0
RDMA: 1 < 301
recv    0
send    -1
RDMA: 2 < 301
recv    2
send    -2

Checking.

@faramir thanks, can you please call postRecv.execute() after endpoint.getWcEvents().take(); like in #38 PR. With it got consistent result.

Thank you @petro-rudenko . Results are consistent now.

However, I'm still facing problem with the occasional deadlock. Eg. server sends -1195 and waits for receiving message by client, and the client sends 1196 and waits for receiving message by server... Are you able to reproduce also this issue?

        /* Server */
        for (int i = 0; i < loopCount + 1; i++) {
            // Recv PING
            clientEndpoint.getWcEvents().take();
            postRecv.execute();
            System.out.println("recv\t" + recvBuf.asIntBuffer().get(0));
            recvBuf.clear();

            //Send PONG
            sendBuf.asIntBuffer().put(0, -i);
            System.out.println("send\t" + sendBuf.asIntBuffer().get(0));
            postSend.execute();
            clientEndpoint.getWcEvents().take();
            sendBuf.clear();
        }
        /* Client */
        for (int i = 0; i < loopCount + 1; i++) {
            // Send PING
            sendBuf.asIntBuffer().put(0, i);
            sendBuf.clear();
            System.out.println("send\t" + sendBuf.asIntBuffer().get(0));
            postSend.execute();
            endpoint.getWcEvents().take();

            // Recv PONG
            endpoint.getWcEvents().take();
            postRecv.execute();
            System.out.println("recv\t" + recvBuf.asIntBuffer().get(0));
            recvBuf.clear();
        }
server$ java -cp disni-2.0-jar-with-dependencies.jar:disni-tests.jar com.ibm.disni.benchmarksDMAvsTcpBenchmarkServer -a 192.168.56.101 -k 3000
client$ java -cp disni-2.0-jar-with-dependencies.jar:disni-tests.jar com.ibm.disni.benchmarksDMAvsTcpBenchmarkClient -a 192.168.56.101 -k 3000

Example output files atteched.
client-102.txt
server-101.txt

I haven't looked at the RdmavsTcpBenchmark in too much detail, but a quick glance made me wonder if there is not a bug with the receive posting. I think the receive at the client should always be posted prior to sending a request. Moving the receive posting just before reaping the event it not enough, the response might already have come back.

What I have just seen (in the output client-102.txt and server-101.txt): on the server, just before freeze, it "received" two times the same value:

RDMA: 1193 < 3001
recv    1193
send    -1193
RDMA: 1194 < 3001
recv    1194
send    -1194
RDMA: 1195 < 3001
recv    1194
send    -1195

while on the client it sends the right values:

RDMA: 1193 < 3001
send    1193
recv    -1193
RDMA: 1194 < 3001
send    1194
recv    -1194
RDMA: 1195 < 3001
send    1195
recv    -1194
RDMA: 1196 < 3001
send    1196

One of the interesting recording of this problem is available here: https://youtu.be/QSt9wZqkloo (after a while, just before freezing, client received next notification and go one step forward).

For me doesn't reproduce, but yes try as @patrickstuedi suggested. Client:

 for (int i = 0; i < loopCount + 1; i++) {
            System.err.println("RDMA: " + i + " < " + (loopCount + 1));

            postRecv.execute();
            // Send PING
            sendBuf.asIntBuffer().put(0, i);
            sendBuf.clear();
            postSend.execute();
            System.out.println("send\t" + sendBuf.asIntBuffer().get(0));
            wc = endpoint.getWcEvents().take();

            // Recv PONG
            wc = endpoint.getWcEvents().take();
            System.out.println("recv\t" + recvBuf.asIntBuffer().get(0));
            recvBuf.clear();
        }

I tried it. Unfortunately, the freezing is still there.

I decided to do simpler test - Client sends only messages to Server, Server receives only messages. That test also lead to deadlock.

/* Client */
for (int i = 0; i < loopCount + 1; i++) {
    System.err.println("RDMA: " + i + " < " + (loopCount + 1));
    // Send PING
    sendBuf.asIntBuffer().put(0, i);
    sendBuf.clear();
    System.out.println("send\t" + sendBuf.asIntBuffer().get(0));
    postSend.execute();
    endpoint.getWcEvents().take();
}

/* Server */
for (int i = 0; i < loopCount + 1; i++) {
    System.err.println("RDMA: " + i + " < " + (loopCount + 1));
    // Recv PING
    postRecv.execute();
    clientEndpoint.getWcEvents().take();
    System.out.println("recv\t" + recvBuf.asIntBuffer().get(0));
    recvBuf.clear();
}

I've tried to invoke clientEndpoint.postRecv(clientEndpoint.getWrList_recv()) instead of using postRecv variable every time in the loop, and to invoke free() after execution, but all those attempts do not help - freeze is still there.

@faramir just watch this talk about rxe / SoftRoce. They had similar issues to yours, where state machine in rxe could just stuck. Long story short:

  • Can you try SoftRoce with the latest kernel
  • If you're using virtualization - make sure you configured KVM / openwswitch (virtualbox could not work)
  • Try to compile this kernel module: https://github.com/larrystevenwise/krping - and try to run ping with different message sizes.

Is there any size limit ?

when I run RDMAvsTcpBenchmarkClient -a IP -k 3000 -s 16384,
RDMA result:
Total time: 1363.831266 ms
Bidirectional bandwidth: 0.06712907722339899 Gb/s
Bidirectional average latency: 0.454610422 ms
TCP result:
Total time: 1800.555879 ms
Bidirectional bandwidth: 0.05084692757541461 Gb/s
Bidirectional average latency: 0.600185293 ms

when I run RDMAvsTcpBenchmarkClient -a IP -k 3000 -s 32768,
blocked at RDMAvsTcpBenchmarkClient::client channel set up

RDMAvsTcpBenchmarkServer -a IP -k 3000 -s 32768
blocked at RDMAvsTcpBenchmarkClient::client channel set up

PS: Linux 5.0.2 soft-roce