Segfault within bash when excluded
dev-zero opened this issue · 11 comments
I built Darshain without MPI, within a container. When running bash with darshan LD_PRELOADed and excluded, every system(...)
call seems to segfault:
5997e87a8f78:/darshan # LD_PRELOAD=/darshan/lib/libdarshan.so DARSHAN_APP_EXCLUDE=bash DARSHAN_ENABLE_NONMPI=1 /bin/bash
5997e87a8f78:/darshan # foo=bar
5997e87a8f78:/darshan # echo $foo
bar
5997e87a8f78:/darshan # env
Segmentation fault
5997e87a8f78:/darshan # ls
Segmentation fault
5997e87a8f78:/darshan #
The use case is that I want to profile other apps contained within containers and called by bash scripts. Running the containers with -e LD_PRELOAD=...
after having copied darshan from my builder container is the easiest way to get there.
Can you generate any sort of backtrace to get clues on where the crash is occurring?
I just tried this out quickly on my laptop and have no problems using non-MPI mode the way you are above. I.e. using LD_PRELOAD I tried excluding bash
as you do, and I don't get any segfaults (and I get Darshan logs for everything except bash
as expected).
It seems that when launching the shell, Darshan crashes for any subshell launched:
❯ ls cores
core.bash.469 core.bash.471 core.bash.473 core.bash.475 core.bash.477 core.bash.479 core.bash.481 core.bash.483
core.bash.470 core.bash.472 core.bash.474 core.bash.476 core.bash.478 core.bash.480 core.bash.482
The backtrace for all the cores starts with this:
Core was generated by `/bin/bash'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 darshan_core_fork_child_cb () at darshan-core.c:2188
2188 parent_pid = __darshan_core->pid;
(gdb) bt
#0 darshan_core_fork_child_cb () at darshan-core.c:2188
#1 0x00007f867fea9158 in __run_fork_handlers (who=atfork_run_child, do_locking=false) at register-atfork.c:135
#2 0x00007f867fef2ad4 in __libc_fork () at ../sysdeps/nptl/fork.c:129
#3 0x000055ef3cec0ced in make_child (command=command@entry=0x0, async_p=0) at jobs.c:1967
#4 0x000055ef3ce9f474 in command_substitute (string=0x55ef3e1307c0 "declare -p COLUMNS 2> /dev/null", quoted=1) at subst.c:6213
...
For example the last one (trying to run ls
):
9943fe2864fe:/darshan # gdb /bin/bash /cores/core.bash.483
GNU gdb (GDB; SUSE Linux Enterprise 15) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-suse-linux".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://bugs.opensuse.org/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /bin/bash...
Reading symbols from /usr/lib/debug/usr/bin/bash-4.4-150400.25.22.x86_64.debug...
[New LWP 483]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/bin/bash'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 darshan_core_fork_child_cb () at darshan-core.c:2188
2188 parent_pid = __darshan_core->pid;
(gdb) bt
#0 darshan_core_fork_child_cb () at darshan-core.c:2188
#1 0x00007f867fea9158 in __run_fork_handlers (who=atfork_run_child, do_locking=false) at register-atfork.c:135
#2 0x00007f867fef2ad4 in __libc_fork () at ../sysdeps/nptl/fork.c:129
#3 0x000055ef3cec0ced in make_child (command=0x55ef3e12d250 "ls", async_p=async_p@entry=0) at jobs.c:1967
#4 0x000055ef3ce85754 in execute_disk_command (cmdflags=0, fds_to_close=0x55ef3e130110, async=0, pipe_out=-1, pipe_in=-1, command_line=0x55ef3e1277f0 "ls", redirects=0x0, words=0x55ef3e12d110) at execute_cmd.c:5189
#5 execute_simple_command (simple_command=<optimized out>, pipe_in=-1, pipe_out=-1, async=0, fds_to_close=0x55ef3e130110) at execute_cmd.c:4396
#6 0x000055ef3ce8355d in execute_command_internal (command=0x55ef3e12eae0, asynchronous=0, pipe_in=-1, pipe_out=-1, fds_to_close=0x55ef3e130110) at execute_cmd.c:811
#7 0x000055ef3ce846d1 in execute_command (command=0x55ef3e12eae0) at execute_cmd.c:405
#8 0x000055ef3ce24366 in reader_loop () at eval.c:180
#9 0x000055ef3ce78acc in main (argc=1, argv=0x7ffefe32c5d8, env=0x7ffefe32c5e8) at shell.c:794
Thanks for the details. Looks like there could be a Darshan bug here related specifically to processes that are calling fork()
. It may be that I didn't run anything that was forking so I didn't hit the issue.
Let me investigate more and see if I can get a fix for this.
It may be a bit more complicated. Running:
9943fe2864fe:/darshan # LD_PRELOAD=/darshan/lib/libdarshan.so DARSHAN_APP_EXCLUDE=bash DARSHAN_ENABLE_NONMPI=1 /bin/bash -c 'ls'
bin include lib share
9943fe2864fe:/darshan # LD_PRELOAD=/darshan/lib/libdarshan.so DARSHAN_APP_EXCLUDE=bash DARSHAN_ENABLE_NONMPI=1 /bin/bash
9943fe2864fe:/darshan # ls
Segmentation fault (core dumped)
9943fe2864fe:/darshan # exit
So, only when bash
is running interactively.
9943fe2864fe:/darshan # LD_PRELOAD=/darshan/lib/libdarshan.so DARSHAN_APP_EXCLUDE=bash DARSHAN_ENABLE_NONMPI=1 /bin/bash -ic 'ls'
bin include lib share
... seems to work, but I get core dumps for the subshells (but not for ls
itself).
Could you perhaps provide more details on how to setup the container you used and how you installed Darshan? I cannot seem to get a reproducer for this on my laptop, using the different bash
commands you used above, as well as trying to force the behavior using applications I know call fork()
.
Sure, the following is the Dockerfile I am using to build a Darshan container:
#syntax=docker/dockerfile:1.4
FROM opensuse/leap:15 as builder
RUN zypper up
RUN zypper --non-interactive install -t pattern \
devel_basis
RUN zypper --non-interactive install \
python3 \
bzip2 \
libbz2-devel
WORKDIR /darshan.build
COPY ./darshan/ ./
RUN set -ex ; \
./prepare.sh ; \
./configure \
--prefix=/darshan \
--with-log-path=/darshan/logs \
--with-jobid-env=PBS_JOBID \
--without-mpi CC=gcc \
; \
make ; \
make install
FROM opensuse/leap:15
RUN zypper --non-interactive install \
python3 \
python3-numpy \
python3-matplotlib \
perl-Encode \
perl-libwww-perl \
gnuplot \
which \
texlive-latex \
texlive-epstopdf \
texlive-psnfss \
texlive-fancyhdr \
texlive-lastpage \
texlive-subfigure \
texlive-multirow \
texlive-threeparttable \
texlive-dvips \
texlive-charter \
&& zypper clean -a
WORKDIR /darshan
COPY --from=builder --link /darshan ./
RUN mkdir /darshan/logs
ENV PATH="$PATH:/darshan/bin"
and then to build and run:
gh repo clone darshan-hpc/darshan
docker build -t darsan
# drop the `colima ssh` if on Linux or replace with equivalent from your setup
echo '/cores/core.%e.%p' | colima ssh sudo tee /proc/sys/kernel/core_pattern
docker run -it --rm --ulimit core=-1 -v ./cores:/cores -v ./darshan:/darshan.build darshan /bin/bash
For a layman like me it looks like a race condition, and when checking darshan_core_initialize
it looks like there are some codepaths where the fork handler is setup, but the main global object is not initialised, like this one:
darshan/darshan-runtime/lib/darshan-core.c
Lines 335 to 343 in f95e9a2
On the other hand should that have been triggered long time ago if that were the case.
Sorry, no other clues from my side.
A couple of updates:
- I can reproduce the crash using your Dockerfile and suggested commands. I'll see if I can find any sort of race or other bug that could lead to it.
- Ignoring that, I can't seem to generate a Darshan log for anything in the container. Have you had any success with that? I haven't had much of a chance to debug why yet, but it kind of looks like Darshan shutdown never gets called (it does get initialized I confirmed). I don't know what would cause that -- Darshan shutdown is initiated as a destructor for when the Darshan shared library is unloaded.
I did find and fix a bug that is waiting in a PR now, if you're interested in trying it out: #922
I think this bug was not actually a race, but just that Darshan was not properly handling the case where an app that is excluded (via APP_EXCLUDE) is calling fork()
. Darshan has handlers that are called on fork()
, and they expected some internal Darshan library state to be set that is clearly not set in the case where an app is excluded.
Thanks for reporting this! And let me know if you have other issues (like the other issue I mention above about not generating any Darshan logs).
Sorry for the late reply and thanks for the fix. I just tested it and the crash is gone.
As for the logs: assuming you mean the Darshan "trace" logs for the application, then yes, I was able to generate them for the application within a container and got the complete report out of it.