cloudfoundry/hwc

Stack overflow crashes from hwc.exe

amhuber opened this issue ยท 33 comments

We're experiencing random but fairly frequent (5-10 a day) crashes in the hwc.exe file. I've capture some of the details from the Windows event logs at https://gist.github.com/amhuber/d4c59ca3e42d19b7b44fe9288ea9cae8. These are apps that have been running for some time using the binary_buildpack and are seeing the crashes after moving to hwc_buildpack. From the CF side the errors look like:

App instance exited with guid efefaa8f-1b71-4654-9e5c-2186845cae0f payload: {"instance"=>"402b9783-eb18-46d4-542d-8f1c660ad285", "index"=>2, "reason"=>"CRASHED", "exit_description"=>"2 error(s) occurred:\n\n* Exited with status -1073741571\n* cancelled", "crash_count"=>1, "crash_timestamp"=>1496251511896170785, "version"=>"c127b9a6-8658-43db-afa2-2c8211cea06a"}

Any ideas on what we can do to help troubleshoot the overflow errors?

Aaron

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/146408413

The labels on this github issue will be updated when the story is started.

Yeah, this error message is quite terrible, and it's a known issue. This occurs when the process reaches the memory limit on the container. Can you first try to cf push the app with more memory?

We checked that already, the memory quota is set to 2G and the container averages around 500MB in memory. According to the CF metrics the size about 7 seconds before the instance crashed was:

cpu=1.6666824668164522, memory=498319360, disk=91588608

That works out to 475MB, so nowhere near the limit.

Aaron

@amhuber If you update to the latest .NET 4.x version do you still see this happen? Looks like something is happening in clr.dll from the Event Log dump.

Some of the crashes are in kernel32 as well. I think we have .NET 4.5.1 on the servers right now, what version are you testing with?

@amhuber We run everything on the latest Windows with all recommended updates and the latest .NET 4.x that comes in Windows updates.

To clarify, are the crashes happening on pushes or longer running processes?

I had to go check, we're actually running 4.6.1 on the servers but otherwise fully patched via Windows Update. Unless you install a newer version manually I think you'll stick at 4.5.1 on Windows 2012 R2 which is probably what you have.

Anyway, we think perhaps at the time the app was crashing the quota might actually have been 512MB which means it probably was the out of memory error. We'll make sure the quota is higher and monitor over the next day or two.

I'll close this as soon as we've confirmed and my apologies in advance if this turned out to be just that. Either way, a better error would be helpful. :-)

@amhuber Yeah, we explicitly install .NET 4.6.x as well on our BOSH deployed cells.

Good to know though, hopefully that was actually the issue!

It would be interesting to see how other runtimes handle this and if we can detect this nicely from hwc.exe which is in Go. If its an OOM error, the actual HWC .NET this code calls is trying to allocate memory is throwing it. The error message should really be coming from the Containerizer, telling why the JobObject was killed.

A while back we did add more logging to garden-windows to try to get messages out when quotas were hit.

I guess, what is your version of garden-windows? (Also definitely consider moving to the bosh release)

We're still on the MSI packages while we work on our Windows stemcells. Good to hear that better error messages may already be in place.

We just confirmed we're still seeing the same crashes even with the memory limit set to 2G and the app only consuming around 500MB. We'll continue troubleshooting and see if we can get a memory dump of the crashing processes.

@amhuber any troubleshooting updates?

We are still able to reproduce the issue, and in addition we have one other case where an ASP.NET app that works under WebAppServer crashes with hwc.exe. In the second case hwc.exe crashes the first time the app is hit.

We're still trying to debug to see if we can get any useful info. We've attached WinDbg to the processes and are looking at the analysis. In both cases Windows is saying that there is an access violation when something is trying to access memory that it shouldn't.

Is there anything that would be useful to help you determine where the issue is? We can send a memory dump of the crashes but I don't know if that helps determine where the issue is.

Adding @jmprice who is working on this as well.

@amhuber yes, the memory dumps may be useful. Could you send one?

Yes, we're working on getting a full memory dump for both issues. Any preferences for how to get the large files to you?

@amhuber anywhere that we can publicly access the memory dumps. We use S3 when sharing the files.

sneal commented

@awmartin I've been able to create a reproducible case that demonstrates the difference in behavior between hwc.exe and webappserver.exe using a newish PCF environment (PEZ) and a Microsoft sample app. The stackoverflow exception that happens under hwc happens immediately and always.

There are two branches so you can easily checkout the hwc vs the webappserver scenarios:

Pushing the master branch it uses the hwc buildpack and causes the application to fail on first page load. A 502 gateway error is returned to the browser and a StackoverflowException is seen in the PCF logs. Bumping up the memory for the app which will sometimes resolve SO exceptions had no effect. I tried 4G, 8G, and finally 14G.

Pushing the webappserver branch it uses the binary buildpack and causes the application to fail on first page load. Instead of a gateway error, the browser is redirected to the error page however another error is generated and the browser is stuck in a continuous redirect loop until the browser catches it. The PCF logs are much more interesting in this scenario, here they contain a proper SQL connection exception.

The app logs for both branches are located in this gist.

sneal commented

@sunjayBhatia @awmartin I have an easy repro that works locally outside Cloud Foundry.

  1. Clone the wingtips repo and build with VisualStudio 2017
  2. Delete all the files in the App_Data folder
  3. Copy the hwc.exe to the root of the wingtips website dir on your dev workstation
  4. Set a PORT env var and start hwc.exe
  5. Load the website hosted in hwc ๐Ÿ’ฅ

Here's the output:

PS C:\Users\vagrant\Documents\src\WingtipToys\WingtipToys> $env:PORT=8040
PS C:\Users\vagrant\Documents\src\WingtipToys\WingtipToys> .\hwc.exe
Server Started for af4957af-2e7e-4508-b8c3-ec2225b042c8

Process is terminated due to StackOverflowException.

Running the app using the same steps but with the WebAppServer.exe works just fine, here's that output:

PS C:\Users\vagrant\Documents\src\WingtipToys\WingtipToys> .\WebAppServer.exe
The use of WebAppServer.exe is deprecated. Please use the hwc-buildpack instead.
PORT == 8040
2017-07-14 00:37:07Z|INFO|Port:8040
2017-07-14 00:37:07Z|INFO|Webroot:C:\Users\vagrant\Documents\src\WingtipToys\WingtipToys
2017-07-14 00:37:07Z|INFO|Starting web server instance...
Server Started.... press CTRL + C to stop
Server shutting down, please wait...

Running the app via IISExpress also works just fine.

@sneal awesome! We will take a look at that ๐Ÿ‘๐Ÿฝ

sneal commented

@sunjayBhatia I found the culprit, the stack size of the golang compiled hwc.exe has a stack size of 128k, the old WebAppServer.exe has a stack size of 1024k.

You can clearly see the difference by running dumpbin /headers <exe> on each executable. You can also bump up the stack space of the hwc.exe by running editbin /STACK:1048576 hwc.exe, with this change the hwc.exe no longer crashes and the web app starts up properly.

PR forthcoming.

@mhoran was right! Thanks @sneal that's awesome

sneal commented

@sunjayBhatia See the linked golang issue, but it looks like there's no easy fix for this. It looks like our current set of options are:

  1. Run a post build step with editbin, but that requires VS tools and Windows.
  2. Use CGO so the linker uses a larger stack size, but that would require the MinGW-w64 compiler.
  3. Revert back to using the WebAppServer.exe.

Thoughts?

@sneal Thanks for doing all this research. We're going to go with option 2. It doesn't seem too difficult to install the MinGW compiler and cross-compile it on linux.

sneal commented

@mdelillo Excellent. If that doesn't pan out there's some open source PE file editors available, like this MIT licensed cross platform Python module.

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/148938569

The labels on this github issue will be updated when the story is started.

^ the original issue is still valid, this second one is an error, please use the first tracker issue to track this story

It appears using CGO on its own will not work. The compilation works fine, however the resulting binary cannot run HWC correctly. We suspect this is due to the stack commit size being too large, causing us to hit this bug: https://blogs.msdn.microsoft.com/dsvc/2012/12/30/unexpected-out-of-memory-exception-in-net-4-5-applications-that-uses-custom-stack-commit-size/

It also looks like CGO hard codes the values for reserved / commited stack size and so we can't easily override them

sneal commented

@sesmith177 Is this fixed? Your comment makes me think it's not.

@sneal Shortly after making that comment, we discovered that setting GO_EXTLINK_ENABLED=1 would cause cgo to use the linker that comes with mingw as opposed to go's linker. By using this linker, the stack size is set appropriately:

PS C:\Users\Administrator> dumpbin /headers Z:\workspace\hwc-rel.exe | findstr stack
          200000 size of stack reserve
            1000 size of stack commit

The recently released 4.0.0 should work correctly

Here's the CL for the cgo change.

https://go-review.googlesource.com/c/49331/

Once it's in, we should try and rebuild from that change, to see if it solves the problem.

@mxplusb I think that CL might run into .NET OutOfMemoryException bug linked above, as the stack commit size is still large (0x00200000 - 0x2000)

Yeah, that's what I am thinking as well, but it's directly related to the challenges we're seeing, so it would be good to test against.

The develop branch of the hwc-buildpack now has the fixed version of HWC: cloudfoundry/hwc-buildpack@2069172