Random race condition during `engine->flushAndWait()` in offscreen rendring mode
elias-Mimetrik opened this issue · 42 comments
Describe the bug
Rendering in headless mode crashes randomly, especially when more than a single renderer is used.
To Reproduce
Steps to reproduce the behavior:
I construct the renderer components as in this thread (#3318), I know this will work given the comment at the end.
This is my code to render a frame:
static void callback(void* buffer, size_t size, void* user) {
cv::Mat image = cv::Mat(500, 500, CV_8UC4, static_cast<uint8_t*>(buffer)).clone();
image.convertTo(image, CV_8U);
cv::cvtColor(image, image, cv::COLOR_RGBA2RGB);
cv::imwrite("aaa.png", image);
delete[] static_cast<uint8_t*>(buffer);
}
...
// Offscreen rendering:
size_t size = 500 * 500 * 4;
uint8_t* buffer = new uint8_t[size];
filament::backend::PixelBufferDescriptor pd(buffer, size,
filament::backend::PixelBufferDescriptor::PixelDataFormat::RGBA,
filament::backend::PixelBufferDescriptor::PixelDataType::UBYTE,
callback, nullptr);
this->filament_renderer->beginFrame(this->swap_chain);
this->filament_renderer->render(this->view);
this->filament_renderer->readPixels(0, 0, 500, 500, std::move(pd));
this->filament_renderer->endFrame();
this->engine->flushAndWait(); // <<<<<<<<<<<<<<<<<<<<<<<<< Exception here!
Basically when I initialise two renderer objects std::unique_ptr<mimetrik::occlusioncamera::Renderer>, the first time I render an image using one of them, it crashes at the this->engine->flushAndWait(); command. I cannot understand why this is happening.. Are we limited to using a single Renderer?
Expected behavior
I expect a frame to be rendered successfully without any exceptions.
Screenshots
If applicable, add screenshots to help explain your problem.
Logs
If applicable, copy full logs from your console here. Please do not
use screenshots of logs, copy them as text, use gist or attach an uncompressed file.
Desktop (please complete the following information):
- OS: ubuntu 24.04
- GPU: RTX 4080
- Backend: Vulkan
Smartphone (please complete the following information):
- Device: [e.g. Pixel 2]
- OS: [e.g. Android Pie 9.0]
Additional context
Add any other context about the problem here.
what's the crash?
I see a stack, but that doesn't tell me what the crash is? Is it a segfault? An assert?
When you say you're using two renderers... are you using them from different threads?
Yes they are created in the same thread.
Two renderers are defined as properties in a class, and initialised sequentially in the constructor. This all works okay. When rendering, the first one calls the render frame function (the exact function above). This rendering process executes okay without exceptions.
Then the second calls the same function. I am doing this in sequence, no threading and nothing in parallel. It crashes at the flushAndWait() command for the second renderer. We think it's a race condition..
Update 1:
I just updated my filament to latest version 1.63.0, and the code above started to give the same exception at the flushAndWait() even when I'm using a single renderer!
Update 2:
I reverted back to the version I was using before 1.59.1, and even a single renderer crashes! Despite exactly the same code working 1 hour ago and clearing the build directory! This crash is completely random!
Are you able to repro with the GL backend?
Surprisinly this issue does not happen in OpenGL backend (single renderer), but the rendered frames are always black (this is reason I'm not using it in the first place).
When I use 2 renderers with OpenGL backend, it crashes at the same line.
Are you sure you don't have another thread that's destroying the Engine while it's waiting on flushAndWait.
I'm not doing any threading at all, other than the threads created by filament internally.
Also, I tried adding sleep of more than 1 sec at the end of the rendering, but it would still crash while waiting!
I have created a standalone working renderer. The rendering is done here: https://github.com/elias-Mimetrik/simple_renderer_filament/blob/master/source/Renderer.cpp#L347
The engine is destroyed in the destructor on L195.
Note that the single renderer code crashes at random (yesterday it was crashing, but not today). The two renderers crash every time (they just need commenting out in DataGenerator class).
Hi,
I've been battling unsuccessfully with readPixels() as well over the past few days. I'm on Windows (latest VS/MSVC), so this may provide some additional insights. I've tried running @elias-Mimetrik's mini-example on my machine. I think there are two situations when the bug / potential race condition can be triggered:
- If I render only one frame (and then call
readPixels()), and then exit the program, then it crashes about 10-20% of the time. And it crashes when returning from main:
You can see here that the main thread is returning from main() and calling destructors, and then it looks to be waiting. Meanwhile the Engine thread is still running (presumably executing the readPixels() callback) and accessing memory that is no longer valid.
If I re-run the app 10 times, it crashes about 1-2 times, and runs through fine the other 8-9 times.
Same crash with the OpenGL backend, about 1-2 times out of 10:
I was previously able to manually trigger the crash (or a similar one?) by calling this->engine->flushAndWait(); after endFrame(), and then it would always crash (100% of the time), but today this actually works and doesn't seem to crash - at least so far, as if this is a race condition, it's probably just a matter of when it will crash again, not if.
- If I render more than one frame in a loop (say, 5), then it sometimes crashes on the second frame.
The crash is again in the Engine thread on the loadFromCpu operation, whereas this time the main thread is not destructing, but rendering another frame or doing something else (sometimes the main thread is executing the callback (which also runs on the main thread), when the Engine thread crashes on the loadFromCpu).
I'm not sure about the logic of Filament internals but it looks to me like there might be a race condition where the engine thread (executing the readPixels instructions?) is trying to read something, while the main thread has already "moved on" and invalidated that memory.
This is very unpredictable - on some days, it seems to crash reliably every single time on the second frame, on some days, it occurs 0 times out of 50...
It seems like the Filament frame_generator example does not crash (or at least hasn't crashed yet), but I can't see a difference between what it is doing and what I am doing (which is also the same as @elias-Mimetrik's example), I copied the code and the sequence of calls from the frame_generator sample. Perhaps there's something in the FilamentApp code somewhere that causes a wait or different timing/threading behaviour which prevents the crash (or makes it much less likely), as opposed to my & @elias-Mimetrik's simpler code. Or maybe there is a subtle difference and our code has a bug that neither of us have spotted?
Over the past few days I've also tried to get the readPixels() code to run in a blocking way, i.e. find a way to wait for the callback to finish, before rendering the next frame. The readPixels() docs suggest this should be possible:
filament/filament/include/filament/Renderer.h
Line 473 in cccf237
However I've tried various approaches, from passing an
std::atomic<bool> into the callback via user data, to flushAndWait(), to using the atomic and then this_thread::sleep_for(...)& engine->pumpMessageQueues(); , and I haven't found a way to use the Fence class - and all the approaches still crashed with the same errors as above.
@pixelflinger Is there any chance you could help with some simple code for this that should work, and then I could try whether it works on my machine, and if it doesn't, then we might have uncovered a Filament bug in Windows? That would be much appreciated. And if there isn't a bug, then there is some working code that will probably also help other people, I've seen quite a few people struggle with offscreen rendering / readPixels. And as a final question perhaps if I may - is there a better way to do offscreen rendering than with readPixels()?
I'm using the latest Filament version 1.63.0, from the precompiled .zip from GitHub Releases.
Thanks a lot for any help.
This is a lot of good information. Thank you. Will take a look soon.
@patrikhuber I'm not understanding the bug yet, but I think you can workaround it by destroying your filament::Renderer yourself instead of letting the Engine cleanup (this would be a lot cleaner anyways).
I can tell that's what's happening because of the stack, ~FRenderer is called from FEngine::destroy.
@pixelflinger Hmm I'm already destroying things manually, if that's what you mean by manually: This is in the d'tor of my own (not Filament) Renderer class:
this->engine->destroy(this->scene);
this->engine->destroy(this->view);
this->engine->destroy(this->swap_chain);
this->engine->destroy(this->filament_renderer);
this->engine->destroy(&this->engine);
I can't delete the filament::Renderer more directly (e.g. delete this->filament_renderer; or this->filament_renderer->~Renderer();) because the destructor is inaccessible. Is there any other way to delete it, or is that what you meant?
I think we're seeing that ~FRenderer is called from FEngine::destroy in the call stack because the renderer is being destroyed "by" the engine by this line: this->engine->destroy(this->filament_renderer).
Also this would only solve crash 1), but not crash 2), which occurs while the main thread is running (and not destructing anything).
So the FEngine shouldn't be destroying the FRenderer if it's already beed destroyed like you're showing here. So something is fishy.
@patrikhuber are you calling Engine::flushAndWait() at any point during shutdown? Can you try to do a Engine::flushAndWait() before destroying the engine?
@patrikhuber @elias-Mimetrik and none of these crashes happen if you don't call readPixels?
@patrikhuber are you calling
Engine::flushAndWait()at any point during shutdown? Can you try to do aEngine::flushAndWait()before destroying the engine?
No, I wasn't calling Engine::flushAndWait() anywhere. I've added it now here:
But it just crashed again, same issue (5 times it ran fine, then one crash):
So the FEngine shouldn't be destroying the FRenderer if it's already beed destroyed like you're showing here. So something is fishy.
I can't imagine you being wrong on this but I think you might be (or there is a misunderstanding). The call stack is showing that FEngine is destroying the FRenderer because this is exactly what this line does: this->engine->destroy(this->filament_renderer);: And I just verified in the debugger that that line will result in this callstack:
So this per se doesn't seem fishy to me? (But I'm probably missing something as I lack your vast experience with Filament.)
oh! okay, you're right, I did misunderstand. So okay what's happening is that destroying the renderer has the side effect of waiting (Which is why we're always there when it crash), and during that waiting, the backend thread is crashing. I bet that if you put a sleep(10) instead of destroying the renderer you'd see the same crash (can you try?).
If I'm right, I'd say this looks like a use-after-free. but, we should detect that how.
Can you confirm that this is definitely related to calling readPixels?
I've commented the readPixels() line:
And it hasn't crashed in 100 runs so far. I do think it's related to the readPixels call.
I've put the readPixels() call back in, and added std::this_thread::sleep_for(std::chrono::seconds(10)); instead of the line that destroys the renderer (you did mean 10 seconds, not milliseconds, right?). I'll report back in a bit whether that still crashes - will take a while, as each run now obviously takes 10 seconds.
Yes I meant 10s.
I've created a similar test locally and I can't make it crash (but I'm on macOS).
your code looks okay as well.
It hasn't crashed so far with the 10 second wait. My guess would be that sleeping before destroying the renderer perhaps actually properly allows the rendering thread to finish the readPixels / rendering operations, whereas maybe the wait in the destroy call is not long enough or at that point some resources have already been freed? It's a bit odd though.
Also an issue with the destruction would only explain half the crashes: It would not explain why it also crashes sometimes on the second frame when rendering multiple frames, where it equally looks like the backend thread is trying to access invalid memory, but the main() loop is ongoing at that point and no destroy/d'tors are being called.
Do you want to send me your test code? I can run that and see if it crashes for me.
It has just crashed with the 10 second wait as well - about the same frequency (1 in 10 (3 out of 30)).
This is maybe a lot to ask, but would you be able to create a standalone sample (like in filament/samples) and upload as a PR? depending on what I find, I can turn it into a unit test. I'm staring at the code and I can't find something obvious. so if I could repro locally, that would probably be enough to find the culprit.
Can you try something else... remove all calls to delete and free from the readPixel callback.
We know the problem is linked to readPixels and readPixels's call back, is "doing stuff", it could be that it's doing something "bad" that frees memory it's not supposed to free, or something like that.
remove all calls to delete and free from the readPixel callback.
And it just crashed again, exactly the same error / callstack.
This is maybe a lot to ask, but would you be able to create a standalone sample (like in filament/samples) and upload as a PR? depending on what I find, I can turn it into a unit test. I'm staring at the code and I can't find something obvious. so if I could repro locally, that would probably be enough to find the culprit.
Okay! I think this is the only thing we can do, if you can't see anything obvious wrong with the code.
Is it okay if it's a standalone sample, not depending on / using the FilamentApp scaffold?
This gave me another idea: I've removed everything from the callback, and I'll see if that crashes too.
yes this is great info. a sample without the FilamentApp is probably even better.
Okay! I'll work on this tomorrow, it's past 1am here now :-).
Thanks a lot for your help so far and for looking into it.
@pixelflinger Repro now pushed as a Filament sample in #9113 - and a few more observations in the PR description.
The bad news is that on macOS I can't reproduce it at all. Not even with the readPixels and no matter which backend I use. This is very strange.
@pixelflinger This may be a silly question, but could I ask whether you've ran the example in a debugger or just on the command-line? On the command-line it does not actually produce a visible error / crash on the command-line, it just exits with a non-zero exit code, and most shells don't show the exit code by default. Sorry I should've communicated that more clearly. I've improved the sample now slightly and added a "Destructed everything successfully" message at the end of the Renderer d'tor. If this is not printed to the terminal, it means there was a crash and non-zero exit code. If you run it with lldb you should see the crash though (if it does reproduce on macOS), and the exception and the full call stacks etc.
We've done a few further tests:
- Ran the sample on a third computer (desktop PC, Ubuntu) that has only an NVIDIA GPU (and no Intel iGPU+NVIDIA, like my laptop does). It still crashed reliably.
- Ran the sample on a fourth computer (old desktop PC, Windows 10) that has only an Intel iGPU (no NVIDIA or other GPU). I didn't try the Vulkan backend on this machine but the OpenGL backend again crashed reliably in the same place.
Assuming you did run the repro in the debugger or checked the exit code and it definitely doesn't crash, it may indeed be that this doesn't repro on macOS. Given it has repro'ed on four different Linux and Windows machines now, do you have any way of testing this on a Linux or Windows machine?
Unfortunately still no dice on macOS even with the debugger. We'll look at it on linux, but it'll have to be next week.
Thanks. I should have said this earlier but keep in mind that on ubuntu it crashes even without calling readPixels():
bool should_draw = this->filament_renderer->beginFrame(this->swap_chain);
if (should_draw) {
this->filament_renderer->render(this->view);
// this->filament_renderer->readPixels(0, 0, 500, 500, std::move(pd));
this->filament_renderer->endFrame(); // <<<<<<<< crashes after this line
}
Unfortunately still no dice on macOS even with the debugger. We'll look at it on linux, but it'll have to be next week.
Ok, thank you!
@patrikhuber , your example #9113 allocated data on the stack which is then passed to filament to set as a buffer. This is not allowed since we don't make a copy of the data. You should use the callback on the buffer descriptor to know when it's ok to deallocate any data on the heap.
This is the cause for the crash on #9113 (I did verify putting these vars into the Renderer class fixes the crash).
But the readPixels race condition crash might still remain.
Hi @poweifeng,
Apologies for the delayed reply, it was a busy week with another deadline. Thanks a lot for having a look at the issue. Oh man, what a silly mistake! I looked at that part of the code and was even questioning the missing delete for the vertices/indices/tbn allocated with new, but I completely missed that the vertex colours are allocated with an std::vector. I immediately knew where it was when I got your reply. Thank you for spotting that, it feels so silly now.
To be honest I don't think there is a race condition with readPixels - this mistake explains all the crash conditions I've seen perfectly. Everything works now.
Could I ask two remaining questions please?
-
Does
engine->flushAndWait()guarantee to wait for the execution of thereadPixels()callback, so isengine->flushAndWait()the correct way to wait forreadPixels(), and if I write into a buffer in thePixelBufferDescriptorcallback, is that buffer then guaranteed to be valid and filled after theflushAndWait()?
ThereadPixels()docs say that one should use aFenceto usereadPixels()in a blocking way, but I couldn't figure out how to use aFenceto do that. -
Is it a good & sufficient approach to delete the allocated vertex/index/tbn/colour data with just a
BufferDescriptorcallback and adelete, like so?
this->vertex_buffer->setBufferAt(*this->engine, 0,
filament::VertexBuffer::BufferDescriptor(
vertices_f,
this->vertex_buffer->getVertexCount() * sizeof(vertices_f[0]),
[](void* buffer, size_t size, void* user) { delete buffer; }));
The Filament sample apps use an additional unique-ptr style State wrapper type for this:
filament/libs/filamentapp/src/MeshAssimp.cpp
Lines 550 to 551 in 8f5f007
Is there a reason for this?
Thank you very much, and thanks a lot again for all your help looking into this.
@poweifeng If you have a moment, it would be much appreciated to resolve my final two questions. Other than that I'm happy to close this issue. Is engine->flushAndWait() the correct way to wait for readPixels(), and why does the MeshAssimp.cpp sample use a State wrapper class over a simple delete buffer; lambda for the BufferDescriptors?
Thanks a lot again.
If you want
@poweifeng If you have a moment, it would be much appreciated to resolve my final two questions. Other than that I'm happy to close this issue. Is
engine->flushAndWait()the correct way to wait forreadPixels(),
If you want to have a point in your code where you can be sure that the readPixels callback has been called, then flushAndWait() is the way to do that.
why does the MeshAssimp.cpp sample use a
Statewrapper class over a simpledelete buffer;lambda for the BufferDescriptors?
I'm not the author of this part. But it is just a convenient, syntactic sugar-ing. You could, of course, do it differently. But I find this implementation pretty easy to read.
@poweifeng Thank you very much for the reply and for your help earlier!
There definitely doesn't look to be an issue with Filament, so this issue can be closed.
Thank you both for your help @poweifeng @pixelflinger.


