osandov/drgn

Segfault in `drgn_program_kernel_core_dump_cache_crashed_thread()`

Closed this issue · 6 comments

Hello.

Opening an s390x vmcore (RHEL 7 kernel 3.10.0-1160.80.1.el7.s390x) on an x86_64 machine renders this:

#0  0x00007f888fe69188 in drgn_program_kernel_core_dump_cache_crashed_thread (prog=<optimized out>) at ../../libdrgn/program.c:1337
1337    ../../libdrgn/program.c: No such file or directory.
[Current thread is 1 (Thread 0x7f889160c740 (LWP 42295))]
(gdb) bt
#0  0x00007f888fe69188 in drgn_program_kernel_core_dump_cache_crashed_thread (prog=<optimized out>) at ../../libdrgn/program.c:1337
#1  drgn_program_crashed_thread (prog=0x55b851cc24f0, ret=ret@entry=0x7ffd58015fd0) at ../../libdrgn/program.c:1385
#2  0x00007f888fe24e35 in Program_crashed_thread (self=<optimized out>) at ../../libdrgn/python/program.c:890
#3  0x00007f889120ad7e in ?? () from /usr/lib/libpython3.11.so.1.0
#4  0x00007f88911f20d7 in PyObject_Vectorcall () from /usr/lib/libpython3.11.so.1.0
#5  0x00007f88911e4379 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.11.so.1.0
#6  0x00007f889120a9e0 in _PyFunction_Vectorcall () from /usr/lib/libpython3.11.so.1.0
#7  0x00007f88912127b0 in ?? () from /usr/lib/libpython3.11.so.1.0
#8  0x00007f88911d9e33 in _PyObject_MakeTpCall () from /usr/lib/libpython3.11.so.1.0
#9  0x00007f88911e4379 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.11.so.1.0
#10 0x00007f889129e9aa in ?? () from /usr/lib/libpython3.11.so.1.0
#11 0x00007f889129e3bc in PyEval_EvalCode () from /usr/lib/libpython3.11.so.1.0
#12 0x00007f88912b52f2 in ?? () from /usr/lib/libpython3.11.so.1.0
#13 0x00007f88911f2eba in ?? () from /usr/lib/libpython3.11.so.1.0
#14 0x00007f88911f20d7 in PyObject_Vectorcall () from /usr/lib/libpython3.11.so.1.0
#15 0x00007f88911e4379 in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.11.so.1.0
#16 0x00007f889120a9e0 in _PyFunction_Vectorcall () from /usr/lib/libpython3.11.so.1.0
#17 0x00007f88912c7d47 in ?? () from /usr/lib/libpython3.11.so.1.0
#18 0x00007f88912c7635 in Py_RunMain () from /usr/lib/libpython3.11.so.1.0
#19 0x00007f8891290c3b in Py_BytesMain () from /usr/lib/libpython3.11.so.1.0
#20 0x00007f8890e39850 in ?? () from /usr/lib/libc.so.6
#21 0x00007f8890e3990a in __libc_start_main () from /usr/lib/libc.so.6
#22 0x000055b85126a045 in _start ()

It used to work just fine up until very recently. Reverting e2e2ebc fixes the issue.

Please check.

Thanks.

I just pushed a fix to https://github.com/osandov/drgn/tree/crashed-thread-from-cpu-curr. Could you please test that? In summary, before e2e2ebc, we would get the correct stack trace but an incorrect task_struct from prog.crashed_thread().object. s390x puts bogus PIDs in the crash dump metadata, which after e2e2ebc caused us to crash because it exposed existing code with a missing error check. My fix handles this s390x quirk and fixes the missing error handling as well.

While building the fix:

CC       python/_drgn_la-stack_trace.lo
../../libdrgn/program.c: In function 'drgn_program_find_thread_kernel_cpu_curr':
../../libdrgn/program.c:1306:15: warning: implicit declaration of function 'linux_helper_cpu_curr'; did you mean 'linux_helper_pid_task'? [-Wimplicit-function-declaration]
1306 |         err = linux_helper_cpu_curr(&thread->object, cpu);
|               ^~~~~~~~~~~~~~~~~~~~~
|               linux_helper_pid_task
../../libdrgn/program.c:1306:13: warning: assignment to 'struct drgn_error *' from 'int' makes pointer from integer without a cast [-Wint-conversion]
1306 |         err = linux_helper_cpu_curr(&thread->object, cpu);
|             ^
CC       python/_drgn_la-symbol.lo

It compiles, but trying to run it I get:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/onatalen/work/src/crush/crush/crush.py", line 13, in <module>
    from drgn import FaultError, MissingDebugInfoError, Program
  File "/usr/lib/python3.11/site-packages/drgn/__init__.py", line 48, in <module>
    from _drgn import (
ImportError: /usr/lib/python3.11/site-packages/_drgn.cpython-311-x86_64-linux-gnu.so: undefined symbol: linux_helper_cpu_curr

Ah, OK, I guess I have to pick cc0994a too. A moment…

Right, cc0994a + 7cb3e99 fixes the issue for me. Thanks for looking into it.

Thanks for testing!