mitsuhiko/fragile

`thread_id` overflow protection is insufficient

Opened this issue · 4 comments

thread_id::next() uses NonZeroUsize::new() to guard against overflow. However, its current behavior of simply raising a panic is insufficient for soundness: if the panic is ignored (through catch_unwind() or by running it in a different thread), then the next call to thread_id::next() will return a duplicate value. This is feasible on 32-bit targets, since it only requires spawning 4.3 billion threads, which don't need to run simultaneously. To demonstrate, this program causes a segmentation fault by sending a &Cell to another thread and reading a partially-written value:

// cargo +nightly build --release --target i686-unknown-linux-gnu
use fragile::{thread_id, Fragile};
use std::{cell::Cell, hint, thread};

fn main() {
    let empty = String::new();
    let s = Cell::new("Hello, world!");
    let fragile = Fragile::new(&s);

    // simulate usize::MAX - 9 iterations of the loop
    thread_id::set(usize::MAX - 8);

    for _ in 0..10 {
        let _ = thread::spawn(|| {
            Fragile::new(());
        })
        .join();
    }

    thread::scope(|scope| {
        scope.spawn(|| {
            let s = fragile.into_inner();
            loop {
                hint::black_box(s);
                let v = s.get();
                hint::black_box(s);
                if v.as_ptr() == empty.as_ptr() && v.len() > 0 {
                    println!("{v}");
                }
            }
        });

        loop {
            s.set(&empty);
            hint::black_box(&s);
            s.set("Hello, world!");
            hint::black_box(&s);
        }
    });
}

For this example, I added a helper method thread_id::set() which simply sets the value of COUNTER to its argument. This also works by spawning all the threads manually, but it would take 60 hours on my machine:

    static COUNT: AtomicU64 = AtomicU64::new(0);
    let limit = usize::MAX as u64 + 1;
    for i in 0..limit {
        if i % 0x10000 == 0 {
            println!("{i:#x}");
        }
        thread::spawn(|| {
            Fragile::new(());
            COUNT.fetch_add(1, Ordering::Release);
        });
    }
    while COUNT.load(Ordering::Acquire) < limit {
        thread::sleep(Duration::from_millis(10));
    }

There are a few different solutions to this problem that I can think of:

  • Use the standard ThreadId.
  • Make COUNTER an AtomicU64.
  • Load the value of COUNTER and check it for overflow before incrementing it. (This is a bit tricky; one might take inspiration from the standard ThreadId.)

This is not entirely wrong, but I think that the provided level of protection is sufficient for real world applications. You are however correct that this is a theoretical problem. I don't think it's worth fixing though because the only way that this could be an issue in practice is that you are intentionally trying to abuse it until it finally succeeds.

The rust stdlib has a similar issue with the Any type where the hash of the type in theory can collide (rust-lang/rust#10389).

I just measured the program on my main desktop, and it would only take 10 hours to spawn 4.3 billion threads.

I don't think it's worth fixing though because the only way that this could be an issue in practice is that you are intentionally trying to abuse it until it finally succeeds.

The risk here is that any sufficiently long-lived 32-bit program that consistently spawns and deletes threads, and catches panics at the thread boundary (e.g., a web server), loses the full Send protection of Fragile. For an example of a use case this could impact, I can imagine such a program holding a shared array of Fragiles, and iterating through it until it finds one that is_valid() for the current thread (even though this design would be far from optimal).

Is there any particular reason that this crate needs a 32-bit thread ID, instead of just making it 64-bit like the standard ThreadId?

Your example makes the assumption that all these fragile objects are actually hanging around. That would require a lot of memory to be allocated to that. For when the objects and threads are recycled you're quite unlikely to run into a collision under real world circumstances.

For what it's worth I'm not opposed to this change (see the open PR #27) but I'm not particularly convinced that this is actually a problem. There is not much of a downside to make this change so I might just do it with the next release but I don't believe this is reason for me to yank the old releases once the change goes out.

My apologies, I did not mean to suggest that anything so drastic as yanking the old versions should be done. I was just trying to argue in favor of fixing this issue at some point instead of letting it stand; I had not noticed the open PR.