Unstable test tests::e2e::test_e2e_download
Closed this issue · 8 comments
This test seems to be unstable in some runs it stucks for minutes, then ended with this error:
thread 'tests::e2e::test_e2e_download' panicked at crates/librqbit/src/tests/e2e.rs:220:27:
called `Result::unwrap()` on an `Err` value: Elapsed(())
so it looks on timeout to finish torrent download ...
When running with logging enabled, looks like it stucks early in download:
2024-08-18T10:33:12.915377Z INFO stats_printer: librqbit::tests::e2e: progress_percent="0.10%" peers="AggregatePeerStats { queued: 0, connecting: 0, live: 0, seen: 1, dead: 1, not_needed: 0, steals: 0 }"
Below is log from one failed test (failure rate seems to be something like 30% or so, so it was relatively easy to replicate)
test_log.txt.zip
The problem was in listen ports racing with each other and no "good" server being able to listen. Some servers are intentionally "bad" for chaos / recovery testing. As visible from your log there was only 1 out of 128 servers that was able to spin up
Hmm, no it still bugged out on continuous testing, looking again
Tried a few more times, it's working much better, however there's still a rare issue where it couldn't download to the end, but it wasn't exposed previously. Probably related to #149
I've tested latest main, problem still appears, but less often, say around 10% of runs, behavior, while previously here were only 1 live connection, now there is plenty of connected peers, but they disconnect before torrent download is finished, see this extract from log:
2024-08-18T19:26:43.770000+0200 INFO stats_printer: librqbit::tests::e2e: progress_percent="89.30%" peers="AggregatePeerStats { queued: 0, connecting: 1, live: 27, seen: 128, dead: 100, not_needed: 0, steals: 144 }" ┤
│2024-08-18T19:26:43.877000+0200 INFO stats_printer: librqbit::tests::e2e: progress_percent="89.30%" peers="AggregatePeerStats { queued: 1, connecting: 0, live: 27, seen: 128, dead: 100, not_needed: 0, steals: 144 }" ┤
│2024-08-18T19:26:43.957000+0200 INFO stats_printer: librqbit::tests::e2e: progress_percent="89.30%" peers="AggregatePeerStats { queued: 0, connecting: 0, live: 18, seen: 128, dead: 110, not_needed: 0, steals: 145 }" ┤
│2024-08-18T19:26:44.081000+0200 INFO stats_printer: librqbit::tests::e2e: progress_percent="89.30%" peers="AggregatePeerStats { queued: 0, connecting: 0, live: 15, seen: 128, dead: 113, not_needed: 0, steals: 146 }" ┤
│2024-08-18T19:26:44.159000+0200 INFO stats_printer: librqbit::tests::e2e: progress_percent="89.30%" peers="AggregatePeerStats { queued: 0, connecting: 0, live: 11, seen: 128, dead: 117, not_needed: 0, steals: 147 }" ┤
│2024-08-18T19:26:44.264000+0200 INFO stats_printer: librqbit::tests::e2e: progress_percent="89.30%" peers="AggregatePeerStats { queued: 0, connecting: 0, live: 12, seen: 128, dead: 116, not_needed: 0, steals: 147 }" ┤
│2024-08-18T19:26:44.350000+0200 INFO stats_printer: librqbit::tests::e2e: progress_percent="89.30%" peers="AggregatePeerStats { queued: 0, connecting: 0, live: 4, seen: 128, dead: 124, not_needed: 0, steals: 148 }" ┤
│2024-08-18T19:26:44.449000+0200 INFO stats_printer: librqbit::tests::e2e: progress_percent="89.30%" peers="AggregatePeerStats { queued: 0, connecting: 0, live: 3, seen: 128, dead: 125, not_needed: 0, steals: 148 }" ┤
│2024-08-18T19:26:44.548000+0200 INFO stats_printer: librqbit::tests::e2e: progress_percent="89.30%" peers="AggregatePeerStats { queued: 0, connecting: 0, live: 0, seen: 128, dead: 128, not_needed: 0, steals: 148 }"
Here is full log:
Here is a script to replicate test failure:
#!/bin/bash
TEST_NO=$1
for i in {1..100}; do
echo -n Running test $i
START=$SECONDS
RUST_LOG=debug RUST_BACKTRACE=1 cargo test -- test_e2e_download --nocapture > /tmp/test_log${TEST_NO}.txt 2>&1
FAIL_CODE=$?
if [ $FAIL_CODE -ne 0 ]; then
echo " - Failed in $(( SECONDS - START )) secs with code $FAIL_CODE"
exit $FAIL_CODE
fi
echo " - Finished in $(( SECONDS - START )) secs"
done
Yeah that was the one I mentioned.
The script is almost the same one I used lol :)
I used this one to run until it fails
set -o pipefail
while RUST_LOG=debug cargo test test_e2e_download -- --nocapture 2>&1 | tee /tmp/test-log; do :; done
I've merged a bunch of changes, including fixing this. It required changing some core logic, so new bugs might show up. But I think I nailed down why this particular issue was happening, described here in detail #198.
So closing this for now, feel free to reopen / open other ones if new bugs show up