Segfaults are difficult to debug

Question

Segfaults are difficult to debug

hut8 opened this issue 3 years ago · 14 comments

I'm on the x86-64 track right now. This platform is really great, so first, thanks so much for making it. I developed my solution locally, and tested it successfully. All the tests pass. I tried make clean && make just in case, and everything works fine. When I upload my solution, all I get is:

☹️ It says that the core is dumped. If I get the core file, I could figure out what's wrong, but it doesn't look like it's available for download. Without that, I can't debug this because "it works on my machine." Any chance you could make broken artifacts available? Can I at least find out more details about exactly what platform it's compiled on?

Many thanks!

Answer 1 · 2022-04-04T23:46:46.000Z

Hi and welcome to Exercism! 👋

Thanks for opening an issue 🙂

If you are suggesting a new feature or an improvement to Exercism, please take a read of this post, which will likely result in a faster response.
If you are reporting a bug in the website, thank you! We are getting a lot of reports at the moment (which is great), but we triage and reply as soon as we can.
If you are requesting support, someone will help shortly.
For everything else, we will reply or triage your issue to the right repository soon.

Answer 2 · 2022-04-05T00:43:27.000Z

Perhaps this could also just be documentation that's added for x86-64. I'm open to suggestions. Having the core files available for download would be very awesome but also more difficult. Here's what I've done so far to help myself:

Clone https://github.com/exercism/x86-64-assembly-test-runner
On the host, run: echo '/tmp/core.%e.%p' | sudo tee /proc/sys/kernel/core_pattern
cd into x86-64-assembly-test-runner
Add cp -r * "$3" into run.sh in order to get the source files and binaries out of the container and into the host's output directory. The source files would be the same, so you could just copy those from your source directory, but the test runner uses alpine so it will be a very different binary than you would produce yourself!
Run docker build -t test-runner .
Adapt docker-run.sh's command (just run the below)

docker run --init \
    --ulimit core=-1 \
    --mount type=bind,source=/tmp/,target=/tmp/ \
    -v $HOME/exercism/x86-64-assembly/two-fer:/mnt/exercism-iteration \
    -v $HOME/exercism/output:/output \
    test-runner two-fer /mnt/exercism-iteration/ /output/

The --init ensures proper signal handling, --ulimit core=-1 make sure cores get created regardless of size, and the --mount is necessary in order to have the core get written and be accessible in the host. Other than that, it's the same as the regular command.

I basically copied this straight from https://ddanilov.me/how-to-configure-core-dump-in-docker-container.

The above command creates the core dump file in the hosts's /tmp.
cp /tmp/core.tests.* $HOME/exercism/output/
install gdb if necessary
cd $HOME/exercism/output
gdb tests core.tests.*

Answer 3 · 2022-04-05T01:03:19.000Z

Another method (probably simpler) is adding gdb in the Dockerfile, changing the Makefile to directly invoke gdb on the tests binary, and adding --cap-add=SYS_PTRACE --security-opt seccomp=unconfined when running docker run in addition to the above arguments in order to allow GDB to do certain things like disabling address space randomization. I'm still experimenting with the best way to debug in the container.

Answer 4 · 2022-04-05T01:10:27.000Z

If I also add musl-dbg I can now see what's going on in ld-musl:

(gdb) run
         Starting program: /mnt/exercism-iteration/tests

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7fc408c in do_relocs (dso=dso@entry=0x7ffff7ffd8a0 <app>, rel=0x555555554508, rel_size=216, stride=stride@entry=3) at ldso/dynlink.c:445
445     ldso/dynlink.c: No such file or directory.

I am somewhat surprised nobody else has run into this issue! It's strange to me that this doesn't even seem to be in my code, it's segfaulting in a dynamic link:

(gdb) bt
        #0  0x00007ffff7fc408c in do_relocs (dso=dso@entry=0x7ffff7ffd8a0 <app>, rel=0x555555554508, rel_size=216, stride=stride@entry=3) at ldso/dynlink.c:445
exercism/exercism#1  0x00007ffff7fc4bde in reloc_all (p=p@entry=0x7ffff7ffd8a0 <app>) at ldso/dynlink.c:1316
exercism/exercism#2  0x00007ffff7fc638a in __dls3 (sp=0x7fffffffecb0) at ldso/dynlink.c:1879
exercism/exercism#3  0x00007ffff7fc5ba7 in __dls2b (sp=0x7fffffffecb0) at ldso/dynlink.c:1660
exercism/exercism#4  0x00007ffff7fc5b4c in __dls2 (base=<optimized out>, sp=0x7fffffffecb0) at ldso/dynlink.c:1638
exercism/exercism#5  0x00007ffff7fc3750 in _dlstart () from /lib/ld-musl-x86_64.so.1
exercism/exercism#6  0x0000000000000001 in ?? ()
exercism/exercism#7  0x00007fffffffeea3 in ?? ()
exercism/exercism#8  0x0000000000000000 in ?? ()

Answer 5 · 2022-04-05T09:36:05.000Z

(cc @exercism/x86-64-assembly)

Answer 6 · 2022-04-05T16:13:53.000Z

Which exercise is this? Could you provide a link to your solution, e.g., in godbolt, https://godbolt.org/z/Ge9K1ev57 and I'll have a look.
That won't solve the segfault debugging issue, but maybe I'll be able to spot something in the code.

Answer 7 · 2022-04-05T21:00:57.000Z

Exercise is two-fer.
Well this is interesting. Small snag that may indicate a bug in my code, or maybe not. They are running nasm 2.14.02 on there, which is the exact version I'm running.

<source>:62: error: impossible combination of address sizes
<source>:62: error: invalid effective address

line is:

cmp byte [rdi], 0 ; is the byte at *rdi zero?

This works fine locally with nasm at the exact same version, but doesn't work on theirs. I have to add -f elf64 to the flags and then it works.

But anyway, here it is, it compiles now: https://godbolt.org/z/9bvoxEzc5

Answer 8 · 2022-04-05T21:15:16.000Z

I added valgrind to the dockerfile and changed the Makefile to run valgrind tests instead of just tests. Here's the output:

==32== Memcheck, a memory error detector
==32== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==32== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==32== Command: ./tests
==32==
==32==
==32== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==32==  Bad permissions for mapped region at address 0x109389
==32==    at 0x405708C: do_relocs (dynlink.c:445)
==32==    by 0x4057BDD: reloc_all (dynlink.c:1316)
==32==    by 0x4059389: __dls3 (dynlink.c:1879)
==32==    by 0x4058BA6: __dls2b (dynlink.c:1660)
==32==    by 0x4058B4B: __dls2 (dynlink.c:1638)
==32==    by 0x405674F: ??? (in /lib/ld-musl-x86_64.so.1)
==32==
==32== HEAP SUMMARY:
==32==     in use at exit: 960 bytes in 7 blocks
==32==   total heap usage: 7 allocs, 0 frees, 960 bytes allocated
==32==
==32== LEAK SUMMARY:
==32==    definitely lost: 0 bytes in 0 blocks
==32==    indirectly lost: 0 bytes in 0 blocks
==32==      possibly lost: 0 bytes in 0 blocks
==32==    still reachable: 48 bytes in 1 blocks
==32==         suppressed: 912 bytes in 6 blocks
==32== Rerun with --leak-check=full to see details of leaked memory
==32==
==32== For lists of detected and suppressed errors, rerun with: -s
==32== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
make: *** [Makefile:26: all] Segmentation fault

Answer 9 · 2022-04-05T21:20:44.000Z

Now I'm getting somewhere. On a whim, I changed the dockerfile to point to 3.14.6 and tried rebuilding. Instead of everything building fine and then blowing up at runtime, I get a useful error:

/usr/lib/gcc/x86_64-alpine-linux-musl/10.3.1/../../../../x86_64-alpine-linux-musl/bin/ld: two_fer.o: warning: relocation in read-only section `.text'
/usr/lib/gcc/x86_64-alpine-linux-musl/10.3.1/../../../../x86_64-alpine-linux-musl/bin/ld: warning: creating DT_TEXTREL in a PIE
collect2: error: ld returned 1 exit status
make: *** [Makefile:31: tests] Error 1

I don't quite understand why this is happening, but I googled around and although I am unsure how to continue if I build a PIE, if I change the cflags and ldflags to replace -pie with -no-pie and -fPIE with -fno-pie then everything works fine. Why do you want to build a PIE in the first place here? We're not building a shared library so I'm unsure of the advantage. I found some fix (unfortunately, I can't change the Makefile on production) but it doesn't make sense that my code won't link properly but others work fine.

Answer 10 · 2022-04-05T21:43:26.000Z

I decided to force pie, since that's default on a lot of systems these days.
You need to use, e.g., lea rsi, [rel msg_pre] when loading your strings from the .data section.
Or add default rel at the top of the file, and use lea rsi, [msg_pre].
This is known as RIP-relative addressing. See http://www.nynaeve.net/?p=192 for info on that.

Answer 11 · 2022-04-05T22:32:20.000Z

Wow, thank you very much for that. I really appreciate it. I changed the addresses of the strings as you specified and it worked. I learned a lot 😄

Answer 12 · 2022-04-05T22:33:07.000Z

Actually, one other question: why does this work fine with gnu libc but not MUSL? Shouldn't it crash in both?

Answer 13 · 2022-04-06T09:21:23.000Z

Actually, one other question: why does this work fine with gnu libc but not MUSL? Shouldn't it crash in both?

Okay, I did some digging :p

I used the following example code:

$ cat foo.asm
section .data
msg db "hello, world!", 0

section .text
global _start
_start:
        mov rdi, msg
        ret

When linking with pie on Alpine I get the following:

$ nasm -felf64 foo.asm
$ ld -pie foo.o
ld: foo.o: warning: relocation in read-only section `.text'
ld: warning: creating DT_TEXTREL in a PIE

Doing the same on Ubuntu yields no output.
However, adding the --warn-shared-textrel flags yields the following output:

ld -pie --warn-shared-textrel foo.o
ld: foo.o: warning: relocation in read-only section `.text'
ld: warning: creating a DT_TEXTREL in a shared object

Looking at the following, it seems like relocations in the .text section is not supported in musl: https://bugs.gentoo.org/707660

All musl developers confirms that musl do not support DT_TEXTREL. Now they are discussing how to provide informative error instead of segfault (runtime protection).

Answer 14 · 2022-04-06T09:37:28.000Z

I agree that getting a segfault in this case is not a good user experience, and it's really hard to figure out what's going on. I'm not sure what the best solution would be though..
Trying to detect this special case somehow, and give a more informative message to the user?
Disable pie?
Switch to an Ubuntu image, where this is allowed?