jrouwe/JoltPhysics

Hitting 'crash' in Emscripten/WASM usage

Closed this issue · 10 comments

std::terminate is getting called when there are enough objects or enough load. This is running in Chrome.

With the default example Jolt thread pool (JPH::JobSystemThreadPool)
abort @ gui_client.js:3250 _abort @ gui_client.js:7737 $abort_message @ gui_client.wasm:0x561564 $demangling_terminate_handler() @ gui_client.wasm:0x561af0 $std::__terminate(void (*)()) @ gui_client.wasm:0x561cc2 $std::terminate() @ gui_client.wasm:0x561c9a $std::__2::condition_variable::wait(std::__2::unique_lock<std::__2::mutex>&) @ gui_client.wasm:0x546613 $JPH::Semaphore::Acquire(unsigned int) @ gui_client.wasm:0x3e4bd8 $JPH::JobSystemWithBarrier::BarrierImpl::Wait() @ gui_client.wasm:0x3e3c8c $JPH::JobSystemWithBarrier::WaitForJobs(JPH::JobSystem::Barrier*) @ gui_client.wasm:0x3e3f59 $JPH::PhysicsSystem::Update(float, int, int, JPH::TempAllocator*, JPH::JobSystem*) @ gui_client.wasm:0x48cceb $PhysicsWorld::think(double) @ gui_client.wasm:0x145667 $GUIClient::timerEvent(MouseCursorState const&) @ gui_client.wasm:0xf6d2c $doOneMainLoopIter() @ gui_client.wasm:0x167764 callUserCallback @ gui_client.js:7628 runIter @ gui_client.js:7937 Browser_mainLoop_runner @ gui_client.js:7852 requestAnimationFrame (async) requestAnimationFrame @ gui_client.js:8172 Browser_mainLoop_scheduler_rAF @ gui_client.js:7762 Browser_mainLoop_runner @ gui_client.js:7855 requestAnimationFrame (async)

With my custom thread pool:
13:30:28.818 gui_client.js:2912 Uncaught RuntimeError: unreachable at gui_client.wasm.__trap (http://localhost/webclient/gui_client.wasm:wasm-function[12410]:0x563858) at ___trap (http://localhost/webclient/gui_client.js:13958:54) at abort (http://localhost/webclient/gui_client.js:3255:3) at _abort (http://localhost/webclient/gui_client.js:7737:2) at gui_client.wasm.abort_message (http://localhost/webclient/gui_client.wasm:wasm-function[12250]:0x56173d) at gui_client.wasm.demangling_terminate_handler() (http://localhost/webclient/gui_client.wasm:wasm-function[12283]:0x561cc9) at gui_client.wasm.std::__terminate(void (*)()) (http://localhost/webclient/gui_client.wasm:wasm-function[12305]:0x561e9b) at gui_client.wasm.std::terminate() (http://localhost/webclient/gui_client.wasm:wasm-function[12303]:0x561e73) at gui_client.wasm.std::__2::condition_variable::wait(std::__2::unique_lock<std::__2::mutex>&) (http://localhost/webclient/gui_client.wasm:wasm-function[10422]:0x5467ec) at gui_client.wasm.JPH::Semaphore::Acquire(unsigned int) (http://localhost/webclient/gui_client.wasm:wasm-function[7277]:0x3e4db1) $__trap @ gui_client.wasm:0x563858 ___trap @ gui_client.js:13958 abort @ gui_client.js:3255 _abort @ gui_client.js:7737 $abort_message @ gui_client.wasm:0x56173d $demangling_terminate_handler() @ gui_client.wasm:0x561cc9 $std::__terminate(void (*)()) @ gui_client.wasm:0x561e9b $std::terminate() @ gui_client.wasm:0x561e73 $std::__2::condition_variable::wait(std::__2::unique_lock<std::__2::mutex>&) @ gui_client.wasm:0x5467ec $JPH::Semaphore::Acquire(unsigned int) @ gui_client.wasm:0x3e4db1 $JPH::JobSystemWithBarrier::BarrierImpl::Wait() @ gui_client.wasm:0x3e3e65 $JPH::JobSystemWithBarrier::WaitForJobs(JPH::JobSystem::Barrier*) @ gui_client.wasm:0x3e4132 $JPH::PhysicsSystem::Update(float, int, int, JPH::TempAllocator*, JPH::JobSystem*) @ gui_client.wasm:0x48cec4 $PhysicsWorld::think(double) @ gui_client.wasm:0x145f89 $GUIClient::timerEvent(MouseCursorState const&) @ gui_client.wasm:0xf74cb $doOneMainLoopIter() @ gui_client.wasm:0x168db9 callUserCallback @ gui_client.js:7628 runIter @ gui_client.js:7937 Browser_mainLoop_runner @ gui_client.js:7852 requestAnimationFrame (async) requestAnimationFrame @ gui_client.js:8172 Browser_mainLoop_scheduler_rAF @ gui_client.js:7762

Ok I guess this is what is happening: (from https://en.cppreference.com/w/cpp/thread/condition_variable/wait)

If these functions fail to meet the postconditions (lock.owns_lock()==true and lock.mutex() is locked by the calling thread), std::terminate is called. For example, this could happen if relocking the mutex throws an exception.

But of course in the Jolt code the lock is acquired immediately above the wait code, so I'm not sure what is going wrong.

std::unique_lock lock(mLock); mCount -= (int)inNumber; mWaitVariable.wait(lock, [this]() { return mCount >= 0; });

Hello,

I've not seen this crash before and it also doesn't make sense to me for the same reasons you mention. I'm guessing that it is not happening in the native version of your app or you would have created an issue sooner.

You could try commenting out the #ifdef JPH_PLATFORM_WINDOWS bits in the Semaphore class to make your Windows native app run the same code and check if you can repro the crash that way?

In any case I don't see any bugs mentioned in the emscripten issue trackers.

Maybe you can try another browser like Firefox (not Safari because that has issues with multithreading)?

Yeah it's just happening in Chrome currently.
Firefox fails for other reasons (hitting 2 GB mem usage with opengl calls fails), but seems to execute this code ok.

Do you have any more info about the Safari multithreading issues?

will try the JPH_PLATFORM_WINDOWS semaphore change on native.

Using condition_variable mWaitVariable on native Windows works fine (as expected).

Do you have any more info about the Safari multithreading issues?

See for example jrouwe/JoltPhysics.js#110 and #577

Luckily the issue seems to have been fixed in a recent development build of Safari.

Using condition_variable mWaitVariable on native Windows works fine (as expected).

A workaround that we could try is to make Linux/WASM use pthreads semaphores. Maybe those will work (at least they won't call std::terminate).

I'm hitting similar crashes in my non-Jolt code. So I even if I run Jolt with the single-threaded task manager, I still need to solve the issue. I think it's most likely miscompilation or misexecution of atomic variables, resulting in double-frees of reference-counted objects.

If you figure out what it is then I'd be curious to hear about it.

I have narrowed it down a little. Seems to be a combination of memory growth and atomics. Something like when the total amount of mem allocated to the WASM program (which I think takes the form of a single large buffer) increases, some atomic variable is accessed or changed incorrectly, leading to the 'crash'.

That makes sense, reallocation and atomics don't go together very well. But then you should be able to work around it by disabling memory growth -s ALLOW_MEMORY_GROWTH=0 and making sure that the initial memory block is large enough -s TOTAL_MEMORY=XXX (both parameters for emcc).

Yes that is how I am working around it and how I worked out that this is likely the issue.
Please feel free to close the issue if you want, as this seems to be another emscripten / wasm bug.