microsoft/ProcDump-for-Linux

Dump creation on exception fails for .NET-based systemd service running under a dedicated account

i-to opened this issue ยท 13 comments

i-to commented

Expected behavior

Dump collected when OutOfMemoryException is thrown.

Actual behavior

[10:10:09 - DEBUG]: WaitForProfilerCompletion: Received status F in /__w/1/s/ProcDump-for-Linux/src/Monitor.cpp, at line 1928
[10:10:09 - DEBUG]: WaitForProfilerCompletion: Received dump length 0 in /__w/1/s/ProcDump-for-Linux/src/Monitor.cpp, at line 1932
[10:10:09 - DEBUG]: WaitForProfilerCompletion: Received dump path  in /__w/1/s/ProcDump-for-Linux/src/Monitor.cpp, at line 1959
[10:10:09 - ERROR]: Exception monitoring failed.

Steps to reproduce the behavior

Target process runs as a systemd service under a dedicated account, and I run the following command under my normal account:

sudo procdump -e -n 1 -f "OutOfMemoryException" -log stdout 5119 ~/dump

System information (e.g., distro, kernel version, etc.)

  • Ubuntu 22.04 on a virtual machine, no containers
  • .NET 8.0.7 from Microsoft repository
  • ProcDump 3.3.0

Additional information

Dumping works as expected on the same machine when I use a small test application, running it either as a normal process (without sudo), or as a systemd service under my user account.

Here is the relevant excerpt from journalctl log for the failing case:

Aug 02 10:10:09 <machine-name> <target-service-name>[12087]: [createdump] Target process is alive
Aug 02 10:10:09 <machine-name> <target-service-name>[5119]: [createdump] The process or container does not have permissions or access: open(/proc/5119/mem) FAILED Permission denied (13)
Aug 02 10:10:09 <machine-name> <target-service-name>[5119]: [createdump] Failure took 1ms

Looks like ProcDump loads one of its components into the target process and then tries to invoke createdump directly from there. AFAIK, createdump must be run with root priviledges, so I'm wondering why it works in other cases and only fails in this one.

Also, perhaps diagnostic can be improved, as it seems that exception monitoring works, but subsequent dump creation fails.

Please let me know if I can help by providing any additional information.

Thanks for reporting this. I'm working on some other issues at the moment, and it will be a bit before I can investigate. One thing to keep in mind is that it's super hard to guarantee that activities like creating a dump has the memory it needs to succeed, considering the system is out of memory. Errors related to this scenario can sometimes surface as seemingly random errors. In the exact same scenario running under the same users, does it succeed in creating a dump of the process while memory consumption is high but not yet at an out of memory condition?

i-to commented

Thanks for the hint, however all processes in this report don't have a real out of memory condition. At this point I just inserted the code that throws and catches OutOfMemoryException periodically in order to troubleshoot the dumping process.

Could you send the following procdump log - /var/tmp/procdumpprofiler.log?

i-to commented

I tried a few times, but this log is only written upon successful execution on a test application, and here's an example: procdumpprofiler.log.

In problematic case when the dump fails, nothing is written to this log. Perhaps it is not flushed on the error path?

i-to commented

I looked a bit into the logging problem. Unfortunately, couldn't find a proper fix quickly, but as a workaround I redirected logs to stdout and then collected them through systemd journal, here's the log for failing case: journalctl.log, hope that helps.

Thanks, that helps. It looks like procdump (in the context of the target application) intercepts the exception just fine. For some reason though, when invoking the .NET runtime createdump, createdump fails with an access denied when trying to access the target process (which happens to be itself). Just to be clear, the repro steps are to run the target process as a systemd service under which user?

If you are able to (you can use Sysmon for Linux for that or other tools), can you find out which user createdump is started as? It should match the user of the target process.

i-to commented

Yes, I confirmed that createdump is started under the same user that runs the service. In this case it is a system account, specifically created to run the service. In particular, it doesn't own many of it's /proc filesystem entries (they are owned by the root), which explains the problem:

~$ ls -l /proc/5119/mem
-rw------- 1 root root 0 Aug  1 22:19 /proc/5119/mem

Actually I'm a bit surprised that createdump is called from the debuggee process and not from the ProcDump itself. It is even stated here that createdump needs to be run from root, and there's a link to a documentation task, which is closed, but I still cannot find this in the documentation, so I'm a bit lost...

Ah, got it. I'm glad you figured out the root cause. In terms of createdump being invoked from the target process, the .NET runtime exposes a diagnostics pipe that exposes a set of commands (one of which is create dump). Clients simply invoke that command on the diagnostics pipe and the runtime decides the best way to honor it.

i-to commented

Thanks for the explanation, now I have the complete picture of the problem. So, what's the plan for this issue, would you consider fixing it or do you see such configuration as an unsupported case?

I've labeled this with 'feature request'. One way to address this is for the profiler in the target process to call back to procdump and have procdump explicitly invoke createdump (assuming you started procdump with sudo). In the case of a true out of memory exception there are no guarantees that this will work though, just best effort. What is the scenario where you need to deny access to proc/{pid}/mem where {pid} is the process itself?

i-to commented

The problem itself is not specific to out-of-memory exception and can happen in any other kind of tracing too. And even when the real OOM is thrown, it's not always the case that the entire system is memory depleted, e.g. it could be that the memory limit is imposed on CLR, or the large object cannot be allocated due to LOH fragmentation, etc. Dumping would help in all those situations.

What is the scenario where you need to deny access to proc/{pid}/mem where {pid} is the process itself?

It was not intentionally setup to be so. Maybe it's simply the default behavior for system accounts (which are meant to be more restricted than normal user accounts), or some specifics of how the service is configured... We'll try to find out why, and whether we can easily change that.

Absolutely, the general problem is that the target process doesn't have access to its own memory map. I was using the specific example of OOM exception where there is no guarantee of success.