Segfaults are difficult to debug
hut8 opened this issue ยท 14 comments
I'm on the x86-64 track right now. This platform is really great, so first, thanks so much for making it. I developed my solution locally, and tested it successfully. All the tests pass. I tried make clean && make
just in case, and everything works fine. When I upload my solution, all I get is:
Many thanks!
Hi and welcome to Exercism! ๐
Thanks for opening an issue ๐
- If you are suggesting a new feature or an improvement to Exercism, please take a read of this post, which will likely result in a faster response.
- If you are reporting a bug in the website, thank you! We are getting a lot of reports at the moment (which is great), but we triage and reply as soon as we can.
- If you are requesting support, someone will help shortly.
- For everything else, we will reply or triage your issue to the right repository soon.
Perhaps this could also just be documentation that's added for x86-64. I'm open to suggestions. Having the core files available for download would be very awesome but also more difficult. Here's what I've done so far to help myself:
- Clone https://github.com/exercism/x86-64-assembly-test-runner
- On the host, run:
echo '/tmp/core.%e.%p' | sudo tee /proc/sys/kernel/core_pattern
- cd into
x86-64-assembly-test-runner
- Add
cp -r * "$3"
intorun.sh
in order to get the source files and binaries out of the container and into the host'soutput
directory. The source files would be the same, so you could just copy those from your source directory, but the test runner uses alpine so it will be a very different binary than you would produce yourself! - Run
docker build -t test-runner .
- Adapt docker-run.sh's command (just run the below)
docker run --init \
--ulimit core=-1 \
--mount type=bind,source=/tmp/,target=/tmp/ \
-v $HOME/exercism/x86-64-assembly/two-fer:/mnt/exercism-iteration \
-v $HOME/exercism/output:/output \
test-runner two-fer /mnt/exercism-iteration/ /output/
The --init
ensures proper signal handling, --ulimit core=-1
make sure cores get created regardless of size, and the --mount
is necessary in order to have the core get written and be accessible in the host. Other than that, it's the same as the regular command.
I basically copied this straight from https://ddanilov.me/how-to-configure-core-dump-in-docker-container.
- The above command creates the core dump file in the hosts's /tmp.
cp /tmp/core.tests.* $HOME/exercism/output/
- install gdb if necessary
cd $HOME/exercism/output
gdb tests core.tests.*
Another method (probably simpler) is adding gdb
in the Dockerfile, changing the Makefile
to directly invoke gdb on the tests
binary, and adding --cap-add=SYS_PTRACE --security-opt seccomp=unconfined
when running docker run
in addition to the above arguments in order to allow GDB to do certain things like disabling address space randomization. I'm still experimenting with the best way to debug in the container.
If I also add musl-dbg
I can now see what's going on in ld-musl:
(gdb) run
Starting program: /mnt/exercism-iteration/tests
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7fc408c in do_relocs (dso=dso@entry=0x7ffff7ffd8a0 <app>, rel=0x555555554508, rel_size=216, stride=stride@entry=3) at ldso/dynlink.c:445
445 ldso/dynlink.c: No such file or directory.
I am somewhat surprised nobody else has run into this issue! It's strange to me that this doesn't even seem to be in my code, it's segfaulting in a dynamic link:
(gdb) bt
#0 0x00007ffff7fc408c in do_relocs (dso=dso@entry=0x7ffff7ffd8a0 <app>, rel=0x555555554508, rel_size=216, stride=stride@entry=3) at ldso/dynlink.c:445
exercism/exercism#1 0x00007ffff7fc4bde in reloc_all (p=p@entry=0x7ffff7ffd8a0 <app>) at ldso/dynlink.c:1316
exercism/exercism#2 0x00007ffff7fc638a in __dls3 (sp=0x7fffffffecb0) at ldso/dynlink.c:1879
exercism/exercism#3 0x00007ffff7fc5ba7 in __dls2b (sp=0x7fffffffecb0) at ldso/dynlink.c:1660
exercism/exercism#4 0x00007ffff7fc5b4c in __dls2 (base=<optimized out>, sp=0x7fffffffecb0) at ldso/dynlink.c:1638
exercism/exercism#5 0x00007ffff7fc3750 in _dlstart () from /lib/ld-musl-x86_64.so.1
exercism/exercism#6 0x0000000000000001 in ?? ()
exercism/exercism#7 0x00007fffffffeea3 in ?? ()
exercism/exercism#8 0x0000000000000000 in ?? ()
(cc @exercism/x86-64-assembly)
Which exercise is this? Could you provide a link to your solution, e.g., in godbolt, https://godbolt.org/z/Ge9K1ev57 and I'll have a look.
That won't solve the segfault debugging issue, but maybe I'll be able to spot something in the code.
Exercise is two-fer.
Well this is interesting. Small snag that may indicate a bug in my code, or maybe not. They are running nasm 2.14.02 on there, which is the exact version I'm running.
<source>:62: error: impossible combination of address sizes
<source>:62: error: invalid effective address
line is:
cmp byte [rdi], 0 ; is the byte at *rdi zero?
This works fine locally with nasm at the exact same version, but doesn't work on theirs. I have to add -f elf64
to the flags and then it works.
But anyway, here it is, it compiles now: https://godbolt.org/z/9bvoxEzc5
I added valgrind to the dockerfile and changed the Makefile to run valgrind tests
instead of just tests
. Here's the output:
==32== Memcheck, a memory error detector
==32== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==32== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==32== Command: ./tests
==32==
==32==
==32== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==32== Bad permissions for mapped region at address 0x109389
==32== at 0x405708C: do_relocs (dynlink.c:445)
==32== by 0x4057BDD: reloc_all (dynlink.c:1316)
==32== by 0x4059389: __dls3 (dynlink.c:1879)
==32== by 0x4058BA6: __dls2b (dynlink.c:1660)
==32== by 0x4058B4B: __dls2 (dynlink.c:1638)
==32== by 0x405674F: ??? (in /lib/ld-musl-x86_64.so.1)
==32==
==32== HEAP SUMMARY:
==32== in use at exit: 960 bytes in 7 blocks
==32== total heap usage: 7 allocs, 0 frees, 960 bytes allocated
==32==
==32== LEAK SUMMARY:
==32== definitely lost: 0 bytes in 0 blocks
==32== indirectly lost: 0 bytes in 0 blocks
==32== possibly lost: 0 bytes in 0 blocks
==32== still reachable: 48 bytes in 1 blocks
==32== suppressed: 912 bytes in 6 blocks
==32== Rerun with --leak-check=full to see details of leaked memory
==32==
==32== For lists of detected and suppressed errors, rerun with: -s
==32== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
make: *** [Makefile:26: all] Segmentation fault
Now I'm getting somewhere. On a whim, I changed the dockerfile to point to 3.14.6
and tried rebuilding. Instead of everything building fine and then blowing up at runtime, I get a useful error:
/usr/lib/gcc/x86_64-alpine-linux-musl/10.3.1/../../../../x86_64-alpine-linux-musl/bin/ld: two_fer.o: warning: relocation in read-only section `.text'
/usr/lib/gcc/x86_64-alpine-linux-musl/10.3.1/../../../../x86_64-alpine-linux-musl/bin/ld: warning: creating DT_TEXTREL in a PIE
collect2: error: ld returned 1 exit status
make: *** [Makefile:31: tests] Error 1
I don't quite understand why this is happening, but I googled around and although I am unsure how to continue if I build a PIE, if I change the cflags and ldflags to replace -pie
with -no-pie
and -fPIE
with -fno-pie
then everything works fine. Why do you want to build a PIE in the first place here? We're not building a shared library so I'm unsure of the advantage. I found some fix (unfortunately, I can't change the Makefile on production) but it doesn't make sense that my code won't link properly but others work fine.
I decided to force pie, since that's default on a lot of systems these days.
You need to use, e.g., lea rsi, [rel msg_pre]
when loading your strings from the .data section.
Or add default rel
at the top of the file, and use lea rsi, [msg_pre]
.
This is known as RIP-relative addressing. See http://www.nynaeve.net/?p=192 for info on that.
Wow, thank you very much for that. I really appreciate it. I changed the addresses of the strings as you specified and it worked. I learned a lot ๐
Actually, one other question: why does this work fine with gnu libc but not MUSL? Shouldn't it crash in both?
Actually, one other question: why does this work fine with gnu libc but not MUSL? Shouldn't it crash in both?
Okay, I did some digging :p
I used the following example code:
$ cat foo.asm
section .data
msg db "hello, world!", 0
section .text
global _start
_start:
mov rdi, msg
ret
When linking with pie on Alpine I get the following:
$ nasm -felf64 foo.asm
$ ld -pie foo.o
ld: foo.o: warning: relocation in read-only section `.text'
ld: warning: creating DT_TEXTREL in a PIE
Doing the same on Ubuntu yields no output.
However, adding the --warn-shared-textrel
flags yields the following output:
ld -pie --warn-shared-textrel foo.o
ld: foo.o: warning: relocation in read-only section `.text'
ld: warning: creating a DT_TEXTREL in a shared object
Looking at the following, it seems like relocations in the .text section is not supported in musl: https://bugs.gentoo.org/707660
All musl developers confirms that musl do not support DT_TEXTREL. Now they are discussing how to provide informative error instead of segfault (runtime protection).
I agree that getting a segfault in this case is not a good user experience, and it's really hard to figure out what's going on. I'm not sure what the best solution would be though..
Trying to detect this special case somehow, and give a more informative message to the user?
Disable pie?
Switch to an Ubuntu image, where this is allowed?