clone() syscall infinitely restarts because of SIGPROF signals
advancedxy opened this issue · 7 comments
When profiling a Spark application with large memory and subprocess execution(launch a subprocess in the JVM or native library side), the whole process was hanging at fork forever.
After some debugging, I believe it's similar with https://bugzilla.redhat.com/show_bug.cgi?id=645528 .
And the workaround is simple: increase interval to 20ms.
You can add this to the README or I can send a pr for this
Interesting. Can you reproduce the issue on a reduced test case? (to make sure this is exactly the problem you referred to).
A paragraph in the troubleshooting section will be helpful then.
Thanks.
Let's setup a minimal reproduce case first then. I will post back when I get one.
Let's setup a minimal reproduce case first then. I will post back when I get one.
Sorry, I tried to setup a minimal reproduce case in JVM only, however the scenario I described cannot be reproduced.
I reproduced the case by simplify my real workload, it's indeed hangs at clone. I cannot post my workload here, but the important debug process can be shared:
After I found the thread hanging forever, I use strace -p $lwpid
to generate following output
--- SIGPROF (Profiling timer expired) @ 0 (0) ---
read(52, "w\322\3\0\0\0\0\0", 8) = 8
gettid() = 32558
ioctl(52, 0x2403, 0) = 0
ioctl(52, 0x2402, 0x1) = 0
rt_sigreturn(0x34) = 56
clone(child_stack=0, flags=CLONE_PARENT_SETTID|SIGCHLD, parent_tidptr=0x7feae49b19bc) = ? ERESTARTNOINTR (To be restarted)
--- SIGPROF (Profiling timer expired) @ 0 (0) ---
read(52, "\345\323\3\0\0\0\0\0", 8) = 8
gettid() = 32558
ioctl(52, 0x2403, 0) = 0
ioctl(52, 0x2402, 0x1) = 0
rt_sigreturn(0x34) = 56
clone(child_stack=0, flags=CLONE_PARENT_SETTID|SIGCHLD, parent_tidptr=0x7feae49b19bc) = ? ERESTARTNOINTR (To be restarted)
--- SIGPROF (Profiling timer expired) @ 0 (0) ---
read(52, "l\321\3\0\0\0\0\0", 8) = 8
gettid() = 32558
ioctl(52, 0x2403, 0) = 0
ioctl(52, 0x2402, 0x1) = 0
rt_sigreturn(0x34) = 56
clone(child_stack=0, flags=CLONE_PARENT_SETTID|SIGCHLD, parent_tidptr=0x7feae49b19bc) = ? ERESTARTNOINTR (To be restarted)
It looks that the clone
is repeatedly interrupted by SIGPROF
Thank you for a great analysis! Feel free to add a README paragraph or leave it to me if you prefer.
I will send a PR for for the README then.
Let me close this one. Thanks again for pointing out this issue.
I was skimming through vmprof-python's code today, and I found this solution that they've implemented for the same issue. I suppose it is something we could add in async-profiler as well (as an opt-in feature).