pthread should use atomic load/store for access of contended fields

Sorry if this is the wrong repo for this issue, I guess it is more related to LLVM but I am finding it through analysis of wasi-libc and WebAssembly threads proposal specifically, and perhaps the change should be scoped to just libc.

I have been working on threads support for wazero, and for example run tests with this simple use of pthread mutex

https://github.com/tetratelabs/wazero/blob/main/experimental/testdata/pthread.c

One cool thing about wazero being a Go runtime is that the Go race detector can be applied. When trying to run that code concurrently from multiple host threads, each with an instantiated wasm module accessing an imported shared memory (the code in main does not reflect this yet, I am trying in https://github.com/tetratelabs/wazero/compare/main...anuraaga:wazero:mod-per-routine?expand=1), races are shown. It's hard to find the actual C code that causes a race, but through some pattern matching, I found for example this code during locking

wasi-libc/libc-top-half/musl/src/thread/pthread_mutex_timedlock.c

Line 77 in a6f8713

while ((r=__pthread_mutex_trylock(m)) == EBUSY) {

trylock is a CAS while r = m->_m_lock is a load of a volatile field. I see the compiled Wasm uses i32.atomic.rmw.cmpxchg for the CAS, but just i32.load, not an atomic load for the volatile read. I can't say I'm an expert on this subject, but my guess is that a non-atomic load is used because on normal hardware, a 32-bit load is atomic vs other atomic instructions, there's no such thing as a non-atomic load per se. However, WebAssembly can be run in many environments, and in the case of wazero's interpreter mode (compiler mode has not been implemented yet and would likely not exhibit this race), CAS operations are not implemented with atomic instructions (Go does not provide the APIs to do so for all the instructions in the thread proposal) and uses locks instead.

cmpxchg locks
https://github.com/tetratelabs/wazero/blob/90eba1b81cacd0a5c1b4351d4d5e73c57d13c4b1/internal/engine/interpreter/interpreter.go#L4198

atomic.load locks
https://github.com/tetratelabs/wazero/blob/main/internal/engine/interpreter/interpreter.go#L3974

normal load doesn't lock and will race with any cmpxchg operation
https://github.com/tetratelabs/wazero/blob/main/internal/engine/interpreter/interpreter.go#L755

Because wasm32-wasi-threads will be executed on arbitrary machines, does it make sense to use atomic.load for volatile read instead of normal one? While I understand that C volatile is not supposed to be a guarantee of atomicity, I wonder if in practice code like in musl does expect it. Or perhaps wasi-libc's fork of musl needs to be made more precise by explicitly specifying atomic load, because it cannot assume a fixed set of supported architectures like musl does. Sorry if this is completely off the mark.

I don't know why musl uses non-atomic loads for these mutex fields. It may be that the code is written to support pre-C11 compilers which don't have <stdatomic.h>, and it's using non-atomic loads as a replacement for C11's memory_order_relaxed, and just assuming that C compilers won't tear aligned 32-bit loads on any reasonable computer, so it'll compile into the intended machine instructions. It uses non-atomic loads for other things too, such as _m_type, which is non-volatile, so I'd guess making volatile loads atomic wouldn't completely eliminate the problem you're seeing.

As I understand it, Wasm semantics don't guarantee than a non-atomic i32.load is tear-free. So if I understand musl's assumption of a "reasonable computer" here, it's not satisfied under wasm.

If this is the case, then one way to fix it would be to modify the musl code to add _Atomic qualifiers to the _m_lock, m_type and any other affected fields.

Thanks @sunfishcode - I agree that musl is probably making an assumption about "reasonable computer"s. And indeed, probably changing volatile operations to atomic is overkill, in the end that's not how C defines volatile.

Glad to see that it could be a simple addition of qualifiers to certain fields - I don't know pthread well enough to find all the fields that may need to be tweaked but hoping someone will be able to understand it better. As a reminder, mutex is this weird union+macro combo, hope the Atomic qualifiers will work on it :D

wasi-libc/libc-top-half/musl/include/alltypes.h.in

Line 109 in a6f8713

    
           TYPEDEF struct { union { int __i[sizeof(long)==8?10:6]; volatile int __vi[sizeof(long)==8?10:6]; volatile void *volatile __p[sizeof(long)==8?5:6]; } __u; } pthread_mutex_t;

wasi-libc/libc-top-half/musl/src/internal/pthread_impl.h

Line 95 in a6f8713

#define _m_lock __u.__vi[1]

Yes, musl code is written in C99 so it not surprising they don't use C11 atomics.

Its does look like the assumption is that loads (and stores?) to volatile are being assumed to be atomic. Notice that musl's atomic.h and atomic_arch.h declares all of kind of atomic opts with the a_ prefix such as a_cas and a_store but there is no a_load..

Are the detected races actually causing bugs in real programs? Having run basically the same musl code under emscripten in browsers for several years now, I don't think we have seen these issue show up in practice.

I don't expect any issues when implementing threads with wazero's compiler backend, which I suspect is closer to how browsers execute wasm, by translating into native instructions. The issue is just in interpreter mode, where Go doesn't provide the primitives needed to implement both atomics and tear-free loads, at least without making those loads also atomic via locking. AFAIK the current interpreter with is a perfectly valid Wasm implementation and it'd be nice if wasi-libc could be tweaked to work with it. If it's not practical, maybe we just have to live with a difference between support for interpreter and compiler though.

That sounds reasonable. If you come up with a proposed musl patch, perhaps you could also sent it to emscripten's fork of musl too?

Even though the wasm spec doesn't guarantee that an aligned i32.load is tear-free, there are probably no production wasm engines where an aligned i32.load uses multiple load instructions to load the value, so it's likely that everyone using musl on wasm with threads has this bug and it just isn't surfaced.

Here's a musl commit discussing musl's use of volatile and its non-use of atomic and the absence of a_load:

https://git.musl-libc.org/cgit/musl/commit/?id=56fbaa3bbe73f12af2bfbbcf2adb196e6f9fe264

I'm very surprised by this, because musl is usually very fastidious about these kinds of details. I would have assumed musl would have an a_load that uses C11 atomics when compiled by a C11 compiler, and only does the non-atomic thing it it's being compiled by a C99 compiler. I'll do some more investigation, but I expect I'll report this to upstream Musl.

Until then, what should we do? Options include:

add _Atomic to these fields, which should fix the bug, but could be a significant performance hit on some architectures?
go through all of Musl and rewrite all uses of these fields to use C11 atomic-load functions?
ignore the problem for now and wait to see what upstream Musl does?

Musl's atomic macros typically work via inline assembly. While its true they could use C11 atomics where available, there is no need move to C11 just to implement a_load. The existing C99+asm mechanism should work fine, no?

Yes, you can implement a_load with C11 APIs or asm, but right now the code doesn't use a_load, so if we take that path we have to go through all of Musl and make it use a_load where it needs to.

Agreed, although auditing all the access locations and adding a_load macros seem more in keeping with the musl coding style, and might even be upstream-able?

Could it also be that not all accesses to a given variable need/want to be atomic.. or is it undefined behaviour to perform non-atomic reads/writes of a given location (even when they don't race)?

Finally, I do wonder how much this change might slow things down, but I guess we should measure and see.

Yes, I think there might be problems in the implementation.

When I was using pthread_mutex_lock / pthread_mutex_unlock with WASI threads on WAMR, the thread sanitizer was complaining about unsafe concurrent load/store operations. When I tried implementing lock and unlock manually using wasm primitives, the problem went away, suggesting that it was coming from the wasi-libc implementation.

I asked about this in WebAssembly/threads, and was pointed to this table which states that Wasm non-atomic integer aligned loads are non-tearing up to 32 bits.

This means I was mistaken above. So, Musl's code is still surprising to me, but for the purposes of this issue, it's no worse on wasm than on other platforms. We could theoretically have a problem if Musl ever does any f32/f64 or i64 loads and relying on them not tearing, but I'd guess it doesn't do those things.

For people running race detectors, I think this means that the compiled code we have today is correct, and it'll be necessary to teach these detectors that wasm i32.load and similar can observe values stored on other threads without synchronization and it's not necessarily a bug.

(Following the links back here) if I'm understanding correctly, is the current musl code doing something like this?

while (true) {
  lock_status = non_atomic_load
  if (lock_status = held by someone else) {
    continue
  } else {
    try to acquire the lock with a CAS
    break if successful
  }
}

This is a common technique for optimising a spin-lock for the contended case (deferring the expensive CAS until you observe the lock has become free). I'd also expect non-Wasm thread sanitisers to potentially label this as a data-race, since from the C11 point of view it's technically undefined behaviour (instead the initial lock status check would need to be a relaxed atomic), but it should be safe to compile to an aligned Wasm i32 load because of the tear-free properties @sunfishcode noted above.

Actually, I need to add an additional caveat here. I believe it's technically permitted for a Wasm compiler to hoist the non-atomic load out of the loop, potentially leading to a deadlock. It appears that musl intends that this is prevented by making the lock volatile. It's commonly accepted wisdom that accesses to volatiles can't be hoisted out of loops by a compiler, but this isn't currently being preserved by the Wasm compilation - the access is being compiled to something in Wasm that can be hoisted when the Wasm is itself compiled.

I don't know if there's any Wasm implementation that's smart enough to do this hoist in practice, because it would need to do some fairly subtle reasoning about the loop terminating iff the CAS succeeds.

Thanks for all the info, great to see people with more practical experience with C here.

since from the C11 point of view it's technically undefined behaviour (instead the initial lock status check would need to be a relaxed atomic), but it should be safe to compile

If this is a situation where it is a race, but a safe race, then perhaps there's nothing we can do but live with race warnings for musl code being executed with the wazero interpreter or wamr (from what I understand of @eloparco's comment).