async-profiler/async-profiler

clone() syscall infinitely restarts because of SIGPROF signals

advancedxy opened this issue · 7 comments

When profiling a Spark application with large memory and subprocess execution(launch a subprocess in the JVM or native library side), the whole process was hanging at fork forever.

After some debugging, I believe it's similar with https://bugzilla.redhat.com/show_bug.cgi?id=645528 .
And the workaround is simple: increase interval to 20ms.

You can add this to the README or I can send a pr for this

Interesting. Can you reproduce the issue on a reduced test case? (to make sure this is exactly the problem you referred to).
A paragraph in the troubleshooting section will be helpful then.
Thanks.

Let's setup a minimal reproduce case first then. I will post back when I get one.

Let's setup a minimal reproduce case first then. I will post back when I get one.

Sorry, I tried to setup a minimal reproduce case in JVM only, however the scenario I described cannot be reproduced.

I reproduced the case by simplify my real workload, it's indeed hangs at clone. I cannot post my workload here, but the important debug process can be shared:

After I found the thread hanging forever, I use strace -p $lwpid to generate following output

--- SIGPROF (Profiling timer expired) @ 0 (0) ---
read(52, "w\322\3\0\0\0\0\0", 8)        = 8
gettid()                                = 32558
ioctl(52, 0x2403, 0)                    = 0
ioctl(52, 0x2402, 0x1)                  = 0
rt_sigreturn(0x34)                      = 56
clone(child_stack=0, flags=CLONE_PARENT_SETTID|SIGCHLD, parent_tidptr=0x7feae49b19bc) = ? ERESTARTNOINTR (To be restarted)
--- SIGPROF (Profiling timer expired) @ 0 (0) ---
read(52, "\345\323\3\0\0\0\0\0", 8)     = 8
gettid()                                = 32558
ioctl(52, 0x2403, 0)                    = 0
ioctl(52, 0x2402, 0x1)                  = 0
rt_sigreturn(0x34)                      = 56
clone(child_stack=0, flags=CLONE_PARENT_SETTID|SIGCHLD, parent_tidptr=0x7feae49b19bc) = ? ERESTARTNOINTR (To be restarted)
--- SIGPROF (Profiling timer expired) @ 0 (0) ---
read(52, "l\321\3\0\0\0\0\0", 8)        = 8
gettid()                                = 32558
ioctl(52, 0x2403, 0)                    = 0
ioctl(52, 0x2402, 0x1)                  = 0
rt_sigreturn(0x34)                      = 56
clone(child_stack=0, flags=CLONE_PARENT_SETTID|SIGCHLD, parent_tidptr=0x7feae49b19bc) = ? ERESTARTNOINTR (To be restarted)

It looks that the clone is repeatedly interrupted by SIGPROF

Thank you for a great analysis! Feel free to add a README paragraph or leave it to me if you prefer.

I will send a PR for for the README then.

Let me close this one. Thanks again for pointing out this issue.

Jongy commented

I was skimming through vmprof-python's code today, and I found this solution that they've implemented for the same issue. I suppose it is something we could add in async-profiler as well (as an opt-in feature).