Chronial/foo_chronflow

Crash on Startup on Vista / fb2k 1.4.4 / ChronFlow 0.5.0

Opened this issue · 16 comments

See also my comment on the hydrogenaudio forum

Unfortunately on my Vista installation of fb2k 1.4.4, chronflow 0.5.0, fb2k crashes as soon as it starts (the UI comes up for a moment, then closes and the crash report window appears). The dumps point to ChronFlow being involved. I have a ChronFlow panel in my layout. When I remove ChronFlow 0.5.0 using the safe mode, fb2k starts just fine. I also tried removing the foo_chronflow.dll.cfg file from my profile, but it didn't fix the error.

I've since verified that the old version 0.4.5 works with fb2k 1.4.4 (so removing 0.5.0 and reinstalling 0.4.5 allows fb2k to start normally and ChronFlow works as usual), and it used to work with 1.4.1, from which I upgraded today. So it must be a regression in 0.5.0.

I'm happy to try a fix, but I'm not set up to build it myself, so I'd need binaries / an fb2k component archive.

failure_00000002.txt

failure_00000002.dmp.txt

Cheers,
Flexx

That is a really odd crash. In case you're interested: The crash log points to an OpenGL helper library I'm using, more specifically a part that handles something about monitors.

So my first question would be whether you happen to have an unusual monitor setup. Next, did you insert multiple foo_chronflow panels into your UI, or only one?

My strongest guess is that this is a bad interaction with some other component. But the only good candidate I found in your list is foo_vis_shpeck – could you try disabling that and checking if that has any effect?

And last but not least, here's a special build for you: foo_chronflow.zip. This shouldn't crash, but you should get an error instead. What does it say?

Hmm... looks interesting. My monitor setup is a pretty standard dual head setup, fb2k runs on the secondary monitor, so in that array the monitor fb2k renders on would come second at i == 1.

So let's try the naive stuff first: Obviously, the first thing I'll try is weather moving it to the primary monitor changes the behaviour:

  1. Check reproducibility by installing 0.5.0 again - yup, happens.
  2. Deinstall ChronFlow in safe mode
  3. Move window to primary monitor, and check it comes up in the same position upon restarting.
  4. Close fb2k, install 0.4.5 via the fb2k-component ZIP, confirm, apply, restart, no crash, works fine.
  5. Update to 0.5.0 via the preferences, apply and consent to restart.
  6. same issue

So it's not dependent on which monitor the window opens up on. So far so bad. :|

  1. Recycle to working config, window on primary monitor, with 0.4.5 working reliably.
  2. Disable the second monitor in the configuration, i. e. don't expand the desktop to it.
  3. Update to 0.5.0.
  4. It crashes again.

That was a even a bit surprising. I hoped that since the window position is irrelevant, that at least the number of actual monitors would change things, like not causing the array index to go out of bounds, but it seems that's either not the problem, or it's caused by the first monitor, and doesn't really need the help of another being there. Now windows counts monitors 1-based, so _glfw.monitors[0] would be monitor #1. But oh well, that's how it's been all along.

  1. So I am back to 2 monitors, with 0.4.5 working.

Upon closer inspection though, the problem is not the array index but rather the _GLFWmonitor* pointer being taken from that array. If that points where it shouldn't, it would explain the access violation seen in the dump. So the trap is probably set up where _gflw.monitors gets initialized with a bogus pointer, the dereference is really just the trap snapping at the touch.

  1. So let's try to remove Shpeck, still with ChronFlow 0.4.5 installed. Works.
  2. Upgrade to 0.5.0 again, aaand: Bummer. Crashes like before :(
  3. Reinstall Speck and 0.4.5, works again.

Finally,

  1. I installed the version from the ZIP archive you provided and it mumbles:
    foo_chronflow error:
    Engine died with exception:
    C:\home\coding\foobar2000\foo_chronflow\TextDisplay.cpp(174)\foo_chronflow.dll!677EAB1F: (caller: 677E9DBD) Exception(1) tid(1b84) 88982F80

Hmm... That's not really making a light bulb appear over my head. ;)

Are you using GFLW only starting with 0.5.0? If so, this is likely a bug stemming from it, as I said the pointer dereference in line 79 of the file you linked will cause an access violation if the pointer is incorrect. I doubt that you can do much about it, GFLW will likely self-initialize without much you could do here. What's odd is that the error seems to occur in the void terminate(void) method, which is commented with // Terminate the library. Could it be that this termination is called twice, so the memory has been freed and then upon a second call another dereference of the same pointer will cause the access violation since the memory doesn't belong to the process any longer (as it's been freed before)?

Do you ever call this method directly? Can you trace if you call it indirectly and if so, how often?

Anyway I doubt that it's a problem involving some other module unless you are only now using GFLW, possibly now causing dual initialization and/or termination and/or unless it's related to something else that's new in 0.5.0, since with 0.4.5 the problem doesn't occur with all other modules being the same. I'll back up my configuration and try what happens if I run a plain fb2k with only ChronFlow added and will report what I see.

Thanks for your quick responses BTW, much appreciated. The work week has started, so I hope I can keep up with you testing stuff... ;)

So deinstalling everything but the bundled plug-ins and ChronFlow didn't change the behaviour, 0.4.5 works, 0.5.0 crashes, the snapshot you provided prints a similar error. I'll attach new dumps in case that helps due to less clutter from other loaded modules.

failure_00000013.txt

failure_00000013.dmp.txt

This looks suspicious: EngineWindow.cpp#L30. Can the count appear to be equal to 0 to different threads or EngineWindow instances or the like causing gflwTerminate() to be called repeatedly, causing the pointers to be dereferenced into freed memory on the second invocation? My C/C++ times are long ago, so I am sorry, I cannot quite pinpoint the issue, but using invocation counts to make something a singleton could be a problem, especially when the code is invoked concurrently. Also is the GLFW engine supposed to be a singleton throughout a process? In that case I'd expect there to be some factory that either returns the global instance, initializing it only when required. But again, I am just shooting from the hip here, I don't know the fb2k SDK, nor GLFW, nor did I really study your code enough.

Looks like you're properly calling the factory, but I don't really see a need for the invocation counter, since glfwInit may safely be called repeatedly. However there are some constraints on glfwTerminate, such as not calling it from callbacks, only on the main thread, and it only being called when all contexts in all other threads are no longer current. In other words: Call this once, during process termination on the main thread. And, the method is called automatically if glfwInit returns GLFW_FALSE, which may be the case here, I assume.

I think you should check for initialization errors using glfwInit() == GLFW_FALSE instead of !glfwInit(). And you need to consider that if any invocation of glfwInit() returns GLFW_FALSE, that glfwTerminate will already have been called, and thus it need not, maybe even must not be called again. Thus, before throwing an exception cased by glfwInit() returning GLFW_FALSE the count would actually need to be decremented, or simply reset to 0.

Even better, I think would be to simply call glfwInit() from the constructor unconditionally, as it will simply return GLFW_TRUE anyway if already initialized, and then call glfwTerminate in a shutdown hook from fb2k if the last call to glfwInit() returned GLFW_TRUE. Since, once properly initialized, subsequent glfwInit calls continue to return GLFW_TRUE, whereas the case where GLFW cannot (ever) be initialized, it will persistently return GLFW_FALSE (thus remembering the result of the last invocation will suffice) and invoke glfwTerminate automatically, thus making it superfluous (if not dangerous) to be invoked in a shutdown hook...

In that case the access violation might be prevented (if I am right about multiple invocations causing it in the first place), but GLFW may not initialize either. But that's another story. 0.4.5 works, so I assume there is a way GL works on my dated system. :)

Next, did you insert multiple foo_chronflow panels into your UI, or only one?

Almost oversaw that question... I've only got one ChronFlow panel in my layout.

Are you using GFLW only starting with 0.5.0?

No, but I'm using a newer version and linking it statically instead of using a dll.

Could it be that this termination is called twice, so the memory has been freed and then upon a second call another dereference of the same pointer will cause the access violation since the memory doesn't belong to the process any longer (as it's been freed before)?

Not really. As you saw this is only called on window destruction, which should only happen once since a window is also only created once. But it should also be fine to call the function repeatedly. For the monitors this is easy to see: terminate() sets monitorCount to 0, so the monitor loop won't even be entered.

This looks suspicious: EngineWindow.cpp#L30. Can the count appear to be equal to 0 to different threads or EngineWindow instances or the like causing gflwTerminate() to be called repeatedly, causing the pointers to be dereferenced into freed memory on the second invocation?

No, enginewindows are only ever created and destroyed in the mainthread.

I'm still really unsure as to why glfwTerminate could fail. But it's not actually necessary to call it at all – the build I gave you simply has that call removed :).

I installed the version from the ZIP archive you provided and it mumbles:
foo_chronflow error:
Engine died with exception:
C:\home\coding\foobar2000\foo_chronflow\TextDisplay.cpp(174)\foo_chronflow.dll!677EAB1F: (caller: 677E9DBD) Exception(1) tid(1b84) 88982F80

Hmm... That's not really making a light bulb appear over my head. ;)

I also got confused by the many numbers at first, but it actually points to the line that is failing:

THROW_IF_FAILED(d2Factory->CreateWicBitmapRenderTarget(
bitmap.get(), D2D1::RenderTargetProperties(), &renderTarget));

And it turns out that CreateWicBitmapRenderTarget is not supported on Vista per default. But it is supported if you install the platform update for windows vista. You might want to try installing that and see if this fixes the issue.

Hi Chronial, thanks for the analysis. I've got the platform update installed (not just now, but tried a reinstall), and it won't solve the issue. But I don't think you have a chance to do much about it. I'll keep using 0.4.5 for the time being, and it's time to get off Vista anyway in the near future. It won't be Windows 10 and thus Microsoft breaking my install, or deleting my files, so I'll probably be moving on to Linux for my main workstation. Too many things won't work on Vista any more, like the new Picard 2.0.4. (It crashes on startup, too.) You might want to officially drop support for Vista, though, pointing to this issue, in case there's another guy like me still using it. ;)

I'm still really unsure as to why glfwTerminate could fail.

As I said earlier, there is a race condition if two threads call it concurrently, since in the code

free(_glfw.monitors);
    _glfw.monitors = NULL;
_glfw.monitorCount = 0;

the memory allocated to _glfw.monitors is freed before the reference is set to null and the count is set to zero, so at least theoretically, if another invocation occurs and reaches the loop between the free call and the counter being zeroed, an illegal access can occur. A terminate call happens implicitly if the initialization fails, and then calling it again might cause the issue.

I think, in any event, they should restructure their code to first set _gflw.monitorCount = 0 and then free(_glfw.moitors) and then set the reference to NULL, eventually guarding / synchronizing the latter two against concurrent invocations. Or doing that for the entire glfwTerminate method.

I also didn't study the GLFW code enough to see if perhaps _gflw.monitors might become non-zero again at some other point, if something (maybe some other plug-in or one of it's constructors of some of it's classes) implicitly sets these again. But anyway, as long as you are on the main thread as you say, and no one else messes with it on some other thread, and you check to call glfwTerminate only if gflwInit returned GLFW_TRUE, this seems to be more of an upstream bug than your problem.

The one thing I am curious, and I think you should be, too, is whether the crash/illegal access could be triggered in the same way by other failures than mine. _glfw.init throws because of my incompatible OS, but any other exception within _glfw.init might cause the same problem on other machines. What sucks (as these are then really hard to track down) is if it's actually timing dependent... That it reliably crashes on my machine suggests otherwise though, it does seem to be running through the same code paths stably, but you can't be sure.

Generally, as a user, I would very much prefer a starting application with the error logged, over a crash-on-startup, so unless you decide to leave out the gflwTerminate call in your next release entirely, you should ensure it cannot cause the access violation from occuring. Personally, BTW, I'd like to see the exception also appear in the fb2k console, not (just) in your window, since you might not even get the window up and running, or the text to be output (albeit less likely to fail). I can be your guinea pig if you like me to try other implementations...

What I'd suggest is that you inspect the whole module for runtime exceptions being thrown explicitly by your own code, and ensuring that they are caught at the latest in your entry point (i. e. the method lowest on the stack that is under your control) and then terminating / disabling / or simply "neutering" your plug-in by logging to the console, that you couldn't initialize and doing nothing further. You must not allow your plug-in to take out the whole process by allowing an uncaught exception to escape further up on the stack beyond your control. The same applies for any external routines you call that may cause runtime exceptions. It's your responsibility to catch and log these errors, then terminate / deactivate gracefully, and to not let them escape beyond your control. I am not sure if access_violation is catchable though? My C++ times are to long gone. ;)

BTW, reading through this, and the link it points to in the 2018-12-31 comment, the actual reason for the access violation might even be the Nvidia driver version I run, which is "the latest" but as outdated as my OS is...

You might want to officially drop support for Vista, though, pointing to this issue, in case there's another guy like me still using it. ;)

I am not 100% convinced that the plugin can't work on Vista yet – I'll wait for a second confirmation.

As I said earlier, there is a race condition if two threads call it concurrently,

I am aware of that, but the only thread this function is ever called from is the main thread.

The one thing I am curious, and I think you should be, too, is whether the crash/illegal access could be triggered in the same way by other failures than mine. _glfw.init throws because of my incompatible OS,

glwf.init actually seems works fine. It is a completely different part of the code that fails for you. There seem to be two separate issues here – one is kind of expected because of your OS (the initial failure), the other is not (the crash within glfw).

Generally, as a user, I would very much prefer a starting application with the error logged, over a crash-on-startup, so unless you decide to leave out the gflwTerminate call in your next release entirely, you should ensure it cannot cause the access violation from occuring.

I can not easily prevent that. The only way to prevent it is to actually go bughunting within gflw which is hard for me, as I can't reproduce the error. I tried running the plugin in a Vista VM, but it just reports the lack of a proper GPU, as it should – without any crasing.

Personally, BTW, I'd like to see the exception also appear in the fb2k console, not (just) in your window, since you might not even get the window up and running, or the text to be output (albeit less likely to fail). I can be your guinea pig if you like me to try other implementations...

If you can't open basic win32 windows, then you have way bigger problems than my component :). But logging to the console is still a good idea – that makes it easier to copy the errors.

What I'd suggest is that you inspect the whole module for runtime exceptions being thrown explicitly by your own code, and ensuring that they are caught at the latest in your entry point (i. e. the method lowest on the stack that is under your control) and then terminating / disabling / or simply "neutering" your plug-in by logging to the console, that you couldn't initialize and doing nothing further. You must not allow your plug-in to take out the whole process by allowing an uncaught exception to escape further up on the stack beyond your control. The same applies for any external routines you call that may cause runtime exceptions.

That is already the case.

It's your responsibility to catch and log these errors, then terminate / deactivate gracefully, and to not let them escape beyond your control. I am not sure if access_violation is catchable though? My C++ times are to long gone. ;)

Access violations are not trivially catchable. I think there might be a way to do that, but that is also not a good idea, because most of all they are not recoverable. An access violation tells you that the process is in an invalid state and the only sensible course of action is to terminate the process – which foobar is already doing.

Thanks a lot for your help with this. If you are willing to test one more version, you could please give this a try: foo_chronflow.zip. I removed the glfwTerminate, so it shouldn't crash for you and you should get a more human-friendly error message :).

Oh, and could you please run dxdiag and see what it says about DirectDraw on the Display Tab?

I managed to get Vista running in my VM, and I think I found the issue with Vista. This version works for me on Vista: foo_chronflow.zip. Care to give it try?

Yup, looks good! Congratulations!

Just looking at your commit... Oh really!? LOL. I wonder if mankind will ever agree on which side to drive our cars on and where to put the green subpixel and such. ;)

// Use GBR for Vista support

Did you mean to write "BGR" there?

Sorry I didn't have time to check it early, I had a couple of buy workweeks and didn't even start my PC at all...

Regarding the GLFW crash on the terminate call, I guess that's now simply circumvented, of course. Which is nice for you and me, still I wonder if we should let upstream know about it, and whether there is some smaller snippet of code that might reproduce the problem. Since you're right: of course, an Access Violation is a bug in their code in any event when they're on the top of the stack when it occurs, and yup, I remember now that they told us that they can't / shouldn't be caught...

Yup, looks good! Congratulations!

Nice to hear :).

Did you mean to write "BGR" there?

I did, thx.

Regarding the GLFW crash on the terminate call, I guess that's now simply circumvented, of course. Which is nice for you and me, still I wonder if we should let upstream know about it, and whether there is some smaller snippet of code that might reproduce the problem.

I can't really do anything about this, as I can't reproduce the problem on my end. If you want to help me investigate further, you could give this version a try:
foo_chronflow.zip