Fatal "Unix.Unix_error "No child process" waitpid" on Recording.finish_recording
shoffmeister opened this issue · 4 comments
- What happened?
Running ./magic-trace attach -pid 2463
, where 2463 is a peculiar process, then pressing Ctrl+C yields a fatal error
[root@fedora]# ./magic-trace attach -pid 2463 -trace-kernel-only
[ Attached. Press Ctrl-C to stop recording. ]
^C[ Got signal, detaching... ]
[ Snapshot taken. ]
(monitor.ml.Error
(Unix.Unix_error "No child process" waitpid
"((mode (WNOHANG)) (pid 21989))")
("Raised at Core_unix.improve in file \"core_unix/src/core_unix.ml\", line 45, characters 4-43"
"Called from Core_unix.wait_gen in file \"core_unix/src/core_unix.ml\", line 923, characters 4-246"
"Called from Core_unix.wait_nohang in file \"core_unix/src/core_unix.ml\" (inlined), line 960, characters 2-46"
"Called from Async_unix__Unix_syscalls.Wait.Kind.waitpid_nohang in file \"src/unix_syscalls.ml\", line 506, characters 16-38"
"Called from Async_unix__Unix_syscalls.Wait.deferred_wait in file \"src/unix_syscalls.ml\" (inlined), line 572, characters 10-39"
"Called from Async_unix__Unix_syscalls.Wait.waitpid in file \"src/unix_syscalls.ml\", line 579, characters 20-51"
"Called from Magic_trace_lib__Perf_tool_backend.Recording.finish_recording in file \"src/perf_tool_backend.ml\", line 257, characters 49-78"
"Called from Magic_trace_lib__Trace.Make_commands.detach.(fun) in file \"src/trace.ml\", line 290, characters 36-80"
"Called from Async_kernel__Deferred0.bind.(fun) in file \"src/deferred0.ml\", line 54, characters 64-69"
"Called from Async_kernel__Job_queue.run_jobs in file \"src/job_queue.ml\", line 167, characters 6-47"
"Caught by monitor Monitor.protect"))
- How can we, magic-trace developers, reproduce the issue you're seeing?
In my case, I am trying to trace pid=2463 which happens to be the top-level (X11) Xorg process of a Linux deskop environment with sddm
being the session manager, i.e.
root 2459 1 0 19:07 ? 00:00:00 /usr/bin/sddm
root 2463 2459 28 19:07 tty2 01:14:26 \_ /usr/libexec/Xorg -nolisten tcp -logverbose 9999 -background none -seat seat0 vt2 -auth /var/run/sddm/c59a1b0d-29ae-4e06-930c-60e4e8f080e4 -noreset -displayfd 15
root 2561 2459 0 19:07 ? 00:00:00 \_ /usr/libexec/sddm-helper --socket /tmp/sddm-authf4e93b32-d111-4352-a2fe-2cd98952d78d --id 1 --start /usr/bin/startplasma-x11 --user stefan
me 2595 2561 0 19:07 ? 00:00:00 \_ /usr/bin/startplasma-x11
me 2803 2595 0 19:07 ? 00:00:00 \_ /usr/bin/ssh-agent /bin/sh -c exec -l /bin/bash -c "/usr/bin/startplasma-x11"
- What version of perf userspace tools are you running?
perf --version
perf version 5.17.6
FWIW, ./magic-trace version
straight off github:
((build_time(1970-01-01 01:00:00.000000+01:00))(x_library_inlining false)(portable_int63 true)(dynlinkable_code false)(ocaml_version"")(executable_path"")(build_system""))
v1.0.2
- What CPU do you have?
cat /proc/cpuinfo
11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz
(That's an 8 core mobile CPU, Tiger Lake)
I am really trying to to trace that very process in order to discover more about its (IMHO) abnormal (and unwanted) CPU usage. My hope would be that this yields some tell-tale symbols on the stack, to continue my investigation.
Alas, ... dead.
It seems as if there are two issues at hand: the root cause of my problems seems to be perf doing a SIGSEGV stunt.
The defect in magic-trace is that it (most probably) is not robust against perf
dying, hence the (by far too late) trigger on the monitor waiting for a perf
process which is already dead.
I'll try to get a handle on why perf
dies.
Sun 2022-05-22 23:09:00 CEST 21753 0 0 SIGSEGV present /usr/bin/perf 700.3K
Sun 2022-05-22 23:10:49 CEST 21914 0 0 SIGSEGV present /usr/bin/perf 699.6K
Sun 2022-05-22 23:12:06 CEST 21989 0 0 SIGSEGV present /usr/bin/perf 701.7K
FWIW, the problem in my perf
is https://lore.kernel.org/linux-perf-users/f0add43b-3de5-20c5-22c4-70aff4af959f@scylladb.com/
Essentially this is Fedora 36 build strategy and the kernel tools implementation causing an infinite recursion inside perf.
Still, making magic-trace more robust against failure of perf
would be nice. :)
And torvalds/linux@0ae065a seems to fix that just in time for upstream kernel 5.18.
Thanks for this report, it was a fun read. I agree that we should have a better message if perf dies unexpectedly, I'll look into it.