dotnet/diagnostics

dotnet stack hangup on trying to get the stackframes of a stuck process

jhudsoncedaron opened this issue · 3 comments

If I had enough information to file this as a bug report I'd file this as a bug report. It feels very much like a bug; but it might be a bug in the runtime, or something else. Anyway; this behavior is very bad and very unexpected.

Background:

We have this network listener process that's been getting stuck every week or so; the process is on our server and is receiving (encrypted) data from the process on the customer server. Our own internal status check on the stuck process also gets stuck; and the symptoms of the stuck-ness make no sense from an application codebase perspective. (Thankfully this process doesn't use async code so the stacktraces ought to make sense.)

So I said OK, lets get a stack trace next time. We looked up how to do this, found dotnet-stack, copied the standalone binary (this URL https://aka.ms/dotnet-stack/win-x64, a week and a half ago) to the server (it's a server core server), and waited for the next time for our process to get stuck.

So it got stuck, as expected. I than ran dotnet-stack report --process-id 4860 and it got stuck. In fact it got stuck so badly that ^C didn't get the command prompt back. I tried a second time; running dotnet-stack report --process-id 4860 > stack.txt and just leaving it running with the remote desktop window shoved in the background. After waiting for at least 14 minutes; found it it was still stuck; only this time ^C was able to get the command prompt back. As expected, the output file was empty.

The target process is an x64 .NET 8 process; working memory was 63MB.

We have a full memory dump of the process; the managed runtime is deadlocked.

Summary:

It's possible for dotnet-stack to get stuck trying to dump stack from a stuck process. This seems like it should not occur.

Environment:

Windows Server Core: probably server core 2022 but might be 2019
Hosting Environment: Azure (Central)
dotnet-stack: win64 standalone binary
target process: .NET 8 winx64 process; shipped as framework included (dotnet publish -r win-x64)

Reproducibility:

At this rate I get one attempt a week.

Stuck-ness does not appear to be data-related. On restarting the process it recovers where it left off, successfully processing the very message it hung up in the middle of.

It's possible for dotnet-stack to get stuck trying to dump stack from a stuck process. This seems like it should not occur.

I think you talked about two different kinds of 'stuck':

  1. When you run dotnet-stack initially it sounded like you were waiting to see it print a stack trace to the console and it wasn't doing so. dotnet-stack is a cooperative tool that sends the .NET runtime a message using a named pipe and then waits to receive a reply back. There is a dedicated thread inside the runtime that is expected to process and reply to these messages, but if the process was in a sufficiently bad state then dotnet-stack might never get a reply. So depending on the state of the process this part may not be a bug, just a consequence that the tool is cooperative rather than preemptive. If you want something that can more reliably get the state of the process even when the runtime's private message reply thread is blocked a debugger is a good choice.

  2. When dotnet-stack didn't print anything for a while you used ctrl-C which you said also wasn't responding. I simulated a non-responsive target process and I think I have reproduced the ctrl-c not aborting portion of the issue. I'm investigating a fix for that.

'if the process was in a sufficiently bad state then dotnet-stack might never get a reply"; turns out you are correct; the process in question was deadlocked in GC (which is another discussion thread).

At least the ^C not working can be fixed.

At least the ^C not working can be fixed.

Yep, the PR I just submitted should handle that part of the issue. Thanks for letting us know!