maidsafe/qp2p

Seeing dropped packets. includes test case.

Closed this issue · 2 comments

We are seeing a problem in brb_node_qp2p where it sometimes drops a packet and BRB process halts. (code here)

I was able to reproduce this with a simple test case which you can run. Just clone and build qp2p_experiments repo.

I am unsure whether the problem exists in qp2p lib or in our usage of it.

You will see in the test case, that we are using send_uni, and are calling connect_to() and new_endpoint() each time we want to send a packet to our peer.

We listen for connections and packets like so:

    while let Some(mut msgs) = conns.next().await {
        println!("Received a connection from {}", msgs.remote_addr());
        while let Some(msg) = msgs.next().await {
            ...
            println!("Got message: {}", msg_str);
        }
        println!("done with msgs.");
    }

After a random-ish number of packets exchanged (always under 5000 so far), this loop fails. In particular, msgs.next() returns None immediately after a connection is received, so the interior of inner while loop never executes.

It seems as though a connection is made and then immediately closed. but.... why?

So I guess the question would be: is this correct behavior of qp2p, and if so, how do we code this loop better?

Here is an example run/output:

Node A

$ cargo run --bin node_uni create
Listening for messages on 127.0.0.1:10000
...
Received a connection from 127.0.0.1:48286
Got message: 343
Sent message: 344 to: 127.0.0.1:10001
done with msgs.  cnt = 344
Received a connection from 127.0.0.1:38339
Got message: 345
Sent message: 346 to: 127.0.0.1:10001
done with msgs.  cnt = 346
Received a connection from 127.0.0.1:43818
Got message: 347
Sent message: 348 to: 127.0.0.1:10001
done with msgs.  cnt = 348
Received a connection from 127.0.0.1:51354
done with msgs.  cnt = 348

Node B

$ cargo run --bin node_uni
...
Received a connection from 127.0.0.1:54098
Got message: 344
Sent message: 345 to: 127.0.0.1:10000
done with msgs.  cnt = 345
Received a connection from 127.0.0.1:32864
Got message: 346
Sent message: 347 to: 127.0.0.1:10000
done with msgs.  cnt = 347
Received a connection from 127.0.0.1:55820
Got message: 348
Sent message: 349 to: 127.0.0.1:10000
done with msgs.  cnt = 349

Notice that on both nodes, the normal pattern is: Received, Got, Sent, Done. However for Node A, the final sequence before it halts is: Received, Done.

btw. running both nodes with RUST_LOG="qp2p=warn,quinn=warn", there are no warnings or errors logged. Running with RUST_LOG="qp2p=trace,quinn=trace" dumps a ton of trace info but no smoking gun for me.

Initially, I was not able to reproduce this on my Linux machine. However, I could reproduce it on Windows.
The issue was that after sending the message the connection was being dropped before the the receiver reads the message. So at the receiver end the peer got ApplicationClosed instead of the message bytes.

The change in dan-da/qp2p_experiments#1 fixes this issue.

This was a bug in the use of the library and not the library itself, so closing this.