mdaffin/loopdev

Race in tests? loop device count before and after mismatch

Opened this issue · 4 comments

Hi,

we are getting a test failure on arm64 in ubuntu mantic with rust-loopdev 0.4.0-3[1]

864s ---- detach_a_backing_file_default stdout ----
864s thread 'detach_a_backing_file_default' panicked at 'assertion failed: `(left == right)`
864s left: `0`,
864s right: `1`: there should be no loopback devices mounted', tests/integration_test.rs:176:5

I added some print statements, ran it in an arm64 vm (which has normally 3 loop devices used already) and it looks like the loop device count is changing under our feet:

---- detach_a_backing_file_with_offset stdout ----
XXX num_devices_at_start=4
XXX list_device(None).len()=4
XXX start list_device(None): [LoopDeviceOutput { name: "/dev/loop1", size_limit: Some(0), offset: Some(0), back_file: Some("/var/lib/snapd/snaps/lxd_25116.snap") }, LoopDeviceOutput { name: "/dev/loop2", size_limit: Some(0), offset: Some(0), back_file: Some("/var/lib/snapd/snaps/snapd_19459.snap") }, LoopDeviceOutput { name: "/dev/loop0", size_limit: Some(0), offset: Some(0), back_file: Some("/var/lib/snapd/snaps/core22_821.snap") }], len=3, var=4
XXX end list_device(None): [LoopDeviceOutput { name: "/dev/loop1", size_limit: Some(0), offset: Some(0), back_file: Some("/var/lib/snapd/snaps/lxd_25116.snap") }, LoopDeviceOutput { name: "/dev/loop2", size_limit: Some(0), offset: Some(0), back_file: Some("/var/lib/snapd/snaps/snapd_19459.snap") }, LoopDeviceOutput { name: "/dev/loop0", size_limit: Some(0), offset: Some(0), back_file: Some("/var/lib/snapd/snaps/core22_821.snap") }], num_devices_at_start=4
thread 'detach_a_backing_file_with_offset' panicked at 'assertion failed: `(left == right)`
  left: `3`,
 right: `4`: there should be no loopback devices mounted', tests/integration_test.rs:180:5

This is the changed function:

fn detach_a_backing_file(offset: u64, sizelimit: u64, file_size: i64) {
    let num_devices_at_start = list_device(None).len();
    println!("XXX num_devices_at_start={}",num_devices_at_start);
    println!("XXX list_device(None).len()={}", list_device(None).len());
    let _lock = setup();

    println!("XXX start list_device(None): {:?}, len={}, var={}", list_device(None), list_device(None).len(), num_devices_at_start);
    {
        let file = create_backing_file(file_size);
        attach_file(
            "/dev/loop3",
            file.to_path_buf().to_str().unwrap(),
            offset,
            sizelimit,
        );

        let ld0 = LoopDevice::open("/dev/loop3")
            .expect("should be able to open the created loopback device");

        ld0.detach()
            .expect("should not error detaching the backing file from the loopdev");

        file.close().expect("should delete the temp backing file");
    };

    std::thread::sleep(std::time::Duration::from_millis(500));

    println!("XXX end list_device(None): {:?}, num_devices_at_start={}", list_device(None), num_devices_at_start);
    assert_eq!(
        list_device(None).len(),
        num_devices_at_start,
        "there should be no loopback devices mounted"
    );
    detach_all();
}

Looks like list_device(None) right after the setup() call has one less loop device already. Which one? I don't know, because if I print list_device(None) right at the start of the test, then the failure doesn't happen anymore ;)

I know this is not the latest version of rust-loopdev, and I checked the git log to see if anything could be a fix for this, but my rust knowledge is close to zero.

Here is a full test run showing the failure[2].

In my test vm, I have to run the tests multiple times (but not many) until I see the failure. It doesn't matter if it has one or two cpus.

  1. https://code.launchpad.net/~git-ubuntu-import/ubuntu/+source/rust-loopdev/+git/rust-loopdev/+ref/applied/ubuntu/devel:
  2. https://autopkgtest.ubuntu.com/results/autopkgtest-mantic/mantic/arm64/r/rust-loopdev/20230814_100137_214a4@/log.gz

Hi!

I am not @mdaffin but I can suggest two things.

  1. Some tests should not be run as part of packaging, because they are too flaky. For loopbacked devices, so much depends on the environment in which the loopbacked tests are run that they may be failing for some reason completely unrelated to the package you are testing. For this reason, when I'm packaging I will sometimes omit tests. You could consider doing that, also.

  2. The symptoms you describe may have to do with that let _lock ... call. Would anything change if it were before the list_devices() call? That lock is solely to prevent tests from interacting with other. Maybe that test is interacting with some other test before it enters the lock. This is only a guess.

I tried moving the lock to the beginning of the test, but then another type of error showed up.

The other thing I tried was to run the tests with -- --test-threads=1, but that didn't change a thing, which was surprising to me. I did that by appending it to the test command in the dep8 control file, like this:

Test-Command: /usr/share/cargo/bin/cargo-auto-test loopdev 0.4.0 --all-targets --no-default-features -- --test-threads=1

In the test log[1], I could confirm that the /tmp/blabla binary that was being run did have --test-threads=1 appended to it (search for "panicked"):

(...)
2091s      Running `/tmp/tmp.RKG9wiYIkO/target/aarch64-unknown-linux-gnu/debug/deps/integration_test-d78a2f903ffb1306 --test-threads=1`
(...)
2092s test detach_a_backing_file_default ... FAILED
(,...)
2092s ---- detach_a_backing_file_default stdout ----
2092s thread 'detach_a_backing_file_default' panicked at 'assertion failed: `(left == right)`
2092s   left: `0`,
2092s  right: `1`: there should be no loopback devices mounted', tests/integration_test.rs:176:5
  1. https://autopkgtest.ubuntu.com/results/autopkgtest-mantic-ahasenack-plusoneweek/mantic/arm64/r/rust-loopdev/20230815_181144_e65d4@/log.gz

Setting the test threads to 1 can't hurt. Moving the lock up was useful because it shows consistency in how you get the error, right is 1 more than left, regardless. You can get really primitive, and up the sleep value from 1/2 second to, e.g., 10 seconds and see what happens?

I tried bumping that sleep to 10s, but it still fails in multiple runs. We can see a 10s delay in between some of this test output:

1590s     Finished test [unoptimized + debuginfo] target(s) in 3m 04s
1590s      Running `/tmp/tmp.9HtpXUL6QB/target/aarch64-unknown-linux-gnu/debug/deps/loopdev-cb50880c722e0d59 --test-threads=1`
1590s 
1590s running 5 tests
1590s test bindings::bindgen_test_layout___kernel_fd_set ... ok
1590s test bindings::bindgen_test_layout___kernel_fsid_t ... ok
1590s test bindings::bindgen_test_layout_loop_config ... ok
1590s test bindings::bindgen_test_layout_loop_info ... ok
1590s test bindings::bindgen_test_layout_loop_info64 ... ok
1590s 
1590s test result: ok. 5 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
1590s 
1590s      Running `/tmp/tmp.9HtpXUL6QB/target/aarch64-unknown-linux-gnu/debug/deps/integration_test-7b7939bb67c218da --test-threads=1`
1590s 
1590s running 13 tests
1590s test attach_a_backing_file_default ... ok
1590s test attach_a_backing_file_with_offset ... ok
1590s test attach_a_backing_file_with_offset_overflow ... ok
1590s test attach_a_backing_file_with_offset_sizelimit ... ok
1590s test attach_a_backing_file_with_sizelimit ... ok
1590s test attach_a_backing_file_with_sizelimit_overflow ... ok
1600s test detach_a_backing_file_default ... FAILED
1600s test detach_a_backing_file_with_offset ... FAILED
1600s test detach_a_backing_file_with_offset_overflow ... FAILED
1600s test detach_a_backing_file_with_offset_sizelimit ... FAILED
1600s test detach_a_backing_file_with_sizelimit ... FAILED
1600s test detach_a_backing_file_with_sizelimit_overflow ... FAILED
1600s test get_next_free_device ... FAILED