ikatson/rqbit

Unstable test tests::e2e::test_e2e_download

Closed this issue · 8 comments

This test seems to be unstable in some runs it stucks for minutes, then ended with this error:

thread 'tests::e2e::test_e2e_download' panicked at crates/librqbit/src/tests/e2e.rs:220:27:
called `Result::unwrap()` on an `Err` value: Elapsed(())

so it looks on timeout to finish torrent download ...

When running with logging enabled, looks like it stucks early in download:

2024-08-18T10:33:12.915377Z  INFO stats_printer: librqbit::tests::e2e: progress_percent="0.10%" peers="AggregatePeerStats { queued: 0, connecting: 0, live: 0, seen: 1, dead: 1, not_needed: 0, steals: 0 }"

Below is log from one failed test (failure rate seems to be something like 30% or so, so it was relatively easy to replicate)
test_log.txt.zip

The problem was in listen ports racing with each other and no "good" server being able to listen. Some servers are intentionally "bad" for chaos / recovery testing. As visible from your log there was only 1 out of 128 servers that was able to spin up

Hmm, no it still bugged out on continuous testing, looking again

Merged another fix in #196, let's monitor

Tried a few more times, it's working much better, however there's still a rare issue where it couldn't download to the end, but it wasn't exposed previously. Probably related to #149

I've tested latest main, problem still appears, but less often, say around 10% of runs, behavior, while previously here were only 1 live connection, now there is plenty of connected peers, but they disconnect before torrent download is finished, see this extract from log:

2024-08-18T19:26:43.770000+0200  INFO stats_printer: librqbit::tests::e2e: progress_percent="89.30%" peers="AggregatePeerStats { queued: 0, connecting: 1, live: 27, seen: 128, dead: 100, not_needed: 0, steals: 144 }"                                    ┤
│2024-08-18T19:26:43.877000+0200  INFO stats_printer: librqbit::tests::e2e: progress_percent="89.30%" peers="AggregatePeerStats { queued: 1, connecting: 0, live: 27, seen: 128, dead: 100, not_needed: 0, steals: 144 }"                                    ┤
│2024-08-18T19:26:43.957000+0200  INFO stats_printer: librqbit::tests::e2e: progress_percent="89.30%" peers="AggregatePeerStats { queued: 0, connecting: 0, live: 18, seen: 128, dead: 110, not_needed: 0, steals: 145 }"                                    ┤
│2024-08-18T19:26:44.081000+0200  INFO stats_printer: librqbit::tests::e2e: progress_percent="89.30%" peers="AggregatePeerStats { queued: 0, connecting: 0, live: 15, seen: 128, dead: 113, not_needed: 0, steals: 146 }"                                    ┤
│2024-08-18T19:26:44.159000+0200  INFO stats_printer: librqbit::tests::e2e: progress_percent="89.30%" peers="AggregatePeerStats { queued: 0, connecting: 0, live: 11, seen: 128, dead: 117, not_needed: 0, steals: 147 }"                                    ┤
│2024-08-18T19:26:44.264000+0200  INFO stats_printer: librqbit::tests::e2e: progress_percent="89.30%" peers="AggregatePeerStats { queued: 0, connecting: 0, live: 12, seen: 128, dead: 116, not_needed: 0, steals: 147 }"                                    ┤
│2024-08-18T19:26:44.350000+0200  INFO stats_printer: librqbit::tests::e2e: progress_percent="89.30%" peers="AggregatePeerStats { queued: 0, connecting: 0, live: 4, seen: 128, dead: 124, not_needed: 0, steals: 148 }"                                     ┤
│2024-08-18T19:26:44.449000+0200  INFO stats_printer: librqbit::tests::e2e: progress_percent="89.30%" peers="AggregatePeerStats { queued: 0, connecting: 0, live: 3, seen: 128, dead: 125, not_needed: 0, steals: 148 }"                                     ┤
│2024-08-18T19:26:44.548000+0200  INFO stats_printer: librqbit::tests::e2e: progress_percent="89.30%" peers="AggregatePeerStats { queued: 0, connecting: 0, live: 0, seen: 128, dead: 128, not_needed: 0, steals: 148 }" 

Here is full log:

test_log2.txt.zip

Here is a script to replicate test failure:

#!/bin/bash

TEST_NO=$1

for i in {1..100}; do
echo -n Running test $i
START=$SECONDS
RUST_LOG=debug RUST_BACKTRACE=1 cargo test -- test_e2e_download --nocapture > /tmp/test_log${TEST_NO}.txt 2>&1
FAIL_CODE=$?
if [ $FAIL_CODE -ne 0 ]; then
    echo " - Failed in $(( SECONDS - START )) secs with code $FAIL_CODE"
    exit $FAIL_CODE
    fi
echo " - Finished in $(( SECONDS - START )) secs"
done

Yeah that was the one I mentioned.
The script is almost the same one I used lol :)

I used this one to run until it fails

set -o pipefail
while RUST_LOG=debug cargo test test_e2e_download -- --nocapture 2>&1 | tee /tmp/test-log; do :; done

I've merged a bunch of changes, including fixing this. It required changing some core logic, so new bugs might show up. But I think I nailed down why this particular issue was happening, described here in detail #198.

So closing this for now, feel free to reopen / open other ones if new bugs show up