biod/sambamba

Segmentation fault with view (random access) on Lustre FS

tcezard opened this issue · 37 comments

Hi,
I should start by saying that our Lustre file system is new and we're only setting things up.

I'm running a samtools view command that randomly seg fault when the data is on our Lustre file system.
The command I'm using is

for i in {1..50};
do 
    echo $i
    sambamba view -f bam -L test_10015AT.bed test_10015AT-sort.bam > /dev/null
done

When run on lustre I get this output

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Segmentation fault (core dumped)
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
Segmentation fault (core dumped)
50
Segmentation fault (core dumped)

Here is the gdb output backtrace if that helps:

#0  0x00002aaaab61330b in __memset_sse2 () from /lib64/libc.so.6
#1  0x000000000057ef15 in gc.gc.GC.malloc() ()
#2  0x0000000000433114 in bio.core.bgzf.inputstream.BgzfInputStream.__ctor() ()
#3  0x0000000000434f70 in bio.bam.randomaccessmanager.RandomAccessManager.__T15readsFromChunksS313bio3bam9readrange11withOffsetsTAS3bio4core4bgzf5chunk5ChunkZ.readsFromChunks() ()
#4  0x0000000000435632 in std.algorithm.iteration.__T9MapResultS1483bio3bam19randomaccessmanager19RandomAccessManager47__T8getReadsS313bio3bam9readrange11withOffsetsZ8getReadsMFAS3bio3bam6region9BamRegionZ9__lambda4TS3std5range121__T3ZipTAAS3bio3bam6region9BamRegionTS3std5range63__T6RepeatTC3bio3bam19randomaccessmanager19RandomAccessManagerZ6RepeatZ3ZipZ.MapResult.front() ()
#5  0x0000000000416d86 in std.algorithm.iteration.__T6joinerTS3std9algorithm9iteration306__T9MapResultS1483bio3bam19randomaccessmanager19RandomAccessManager47__T8getReadsS313bio3bam9readrange11withOffsetsZ8getReadsMFAS3bio3bam6region9BamRegionZ9__lambda4TS3std5range121__T3ZipTAAS3bio3bam6region9BamRegionTS3std5range63__T6RepeatTC3bio3bam19randomaccessmanager19RandomAccessManagerZ6RepeatZ3ZipZ9MapResultZ.joiner() ()
#6  0x00000000004b8332 in sambamba.utils.view.alignmentrangeprocessor.BamSerializer.__T7processTS3std9algorithm9iteration356__T6joinerTS3std9algorithm9iteration306__T9MapResultS1483bio3bam19randomaccessmanager19RandomAccessManager47__T8getReadsS313bio3bam9readrange11withOffsetsZ8getReadsMFAS3bio3bam6region9BamRegionZ9__lambda4TS3std5range121__T3ZipTAAS3bio3bam6region9BamRegionTS3std5range63__T6RepeatTC3bio3bam19randomaccessmanager19RandomAccessManagerZ6RepeatZ3ZipZ9MapResultZ6joinerFS3std9algorithm9iteration306__T9MapResultS1483bio3bam19randomaccessmanager19RandomAccessManager47__T8getReadsS313bio3bam9readrange11withOffsetsZ8getReadsMFAS3bio3bam6region9BamRegionZ9__lambda4TS3std5range121__T3ZipTAAS3bio3bam6region9BamRegionTS3std5range63__T6RepeatTC3bio3bam19randomaccessmanager19RandomAccessManagerZ6RepeatZ3ZipZ9MapResultZ6ResultTC3bio3bam6reader9BamReaderZ.process() ()
#7  0x00000000004dd76c in sambamba.view.__T12sambambaMainTC3bio3bam6reader9BamReaderZ.sambambaMain() ()
#8  0x00000000004cb955 in sambamba.view.__T12sambambaMainTC3bio3bam6reader9BamReaderZ.sambambaMain() ()
#9  0x00000000004054e3 in sambamba.view.view_main() ()
#10 0x0000000000588924 in _d_run_main ()
#11 0x00002aaaab5add5d in __libc_start_main () from /lib64/libc.so.6
#12 0x0000000000404969 in _start ()

When I run the same command on the local filesystem there are no segmentation faults.
Is there any known problem with Lustre file system?
I also ran the equivalent samtools view command but that did not results in segfaults.

Hi,
I'm not aware of any inherent issues with Lustre FS, as bcbio-nextgen is often deployed on it.

Could you try to compile the tool from source? It would be helpful to see the backtrace of a debug build. (make sambamba-ldmd2-debug)

I may be running into similar issues using bcbio that we've not been able to reproduce outside of a specific compute environment (Raijin @ NCI Australia). Biggest hurdle to debug is not being able to build a debug version from source as LDC is not available here. While I am trying to have it installed globally would it be possible to have a current sambamba binary with debug flags set?

Looking at the stack trace the problem is an interplay of bgzf and the GC on specific machines. @chapmanb also complained about a similar issue. I recently built sambamba with a newer ldc and it may behave differently. @tcezard and @ohofmann: are you able to identify the machine that segfaults so we can reproduce this issue? I can send you sambamba with and without debug support for testing.

@pjotrp Yes, absolutely -- Brad put together a reproducible example, we just haven't managed to crash it anywhere other than Raijin, and the support team there is still busy trying to install LDC so we can build sambamba from scratch. Happy to give the test version a whirl.

OK, I'll build it and make it available on Thursday or Friday when I have good internet again. I'll also provide an ldc build using http://lists.gnu.org/archive/html/guix-devel/2017-01/msg01322.html so you can build yourself.

Using GNU Guix I have created a relocatable version of sambamba with full debug
information included.

Download the tarball from

http://biogems.info/contrib/genenetwork/guix-sambamba-debug-0.6.5-x86_64.tgz

md5sum is 6eaefc19adcf2dbce60cf18a15faea4a. Unpack the
tarball. Install the software by running the contained install.sh
script with the target dir, e.g.

  ./install.sh $HOME/opt/sambamba-debug

You can find the sambamba binary in the target dir.

To test the backtrace you can trigger a segfault simply with

  gdb --args sambamba view
(gdb) run
Program received signal SIGSEGV, Segmentation fault.
sambamba.view.printUsage() () at /home/wrk/izip/git/opensource/D/sambamba/sambamba/view.d:93
warning: Source file is more recent than executable.
93          *p = 'X'; // force an exception
(gdb) bt
#0  sambamba.view.printUsage() () at /home/wrk/izip/git/opensource/D/sambamba/sambamba/view.d:93
#1  0x00000000004a7886 in sambamba.view.view_main(immutable(char)[][]) (args=...) at /home/wrk/izip/git/opensource/D/sambamba/sambamba/view.d:181
#2  0x000000000060fab0 in D main (args=...) at /home/wrk/izip/git/opensource/D/sambamba/main.d:74

Please test. Maybe the problem goes away without optimizations.

I realise the debug output will be less for you because the source path are missing. The quick workaround is to check out the sambamba files in /home/wrk/izip/git/opensource/D/sambamba.

Man, I am challenged. Reopening.

Symbols are in this file http://biogems.info/contrib/genenetwork/sambamba.debug.tgz. If you check out the source dir you should be able to get full debug. Please try. Mail me if you need more instructions.

I managed to deploy this, but of course our reproducible example to cause core dump no longer reproduces.. at least not with this build. I am now running the whole bcbio workflow (somatic WGS) to see if this holds up or if we need to come up with a new test case. More in a bit.

That would be good news. ldc was updated. If it holds up, can you also take a look at performance - we don't want it to degrade ;). In the next step I can update LLVM to latest too.

I haven't fully followed this thread, but just a couple of notes in case they are relevant:

  1. #260 should mean debug symbols are available more easily (in releases, perhaps)
  2. I have experienced segfaults in the past with certain compiler/library combinations that are not "real" segfaults - they hide normal exceptions thrown by sambamba due to issues with exception handling in the library/compiler. This certainly applied to the official release binaries. I think I remember some older compiler/libc incident with stack unwinding or similar and didn't look into this further. What I do know is that we run custom-compiled sambamba in production and no longer see this, as well as getting a much better debugging experience (line numbers and locals). My "blessed" version is LDC 1.0.0 built with LLVM 3.7.1. Correcting this is likely to reveal the real problem if you are experiencing something similar. Of course, it could also be a "real" segfault.
  3. 2 should be made easier by the discussion around standardising/improving release binaries (see end of closed ticket #243).

Sorry if some of this does not apply here.

I am on a slight tangent because I am using GNU Guix to build sambamba with or without symbols. These binaries can now be installed anywhere and do not require docker - you can try above URLs. We are trying to track down this particular bug - when that is done I can propose creating releases from Guix (that will also work in Docker). Building from source with ldc is a bit of a challenge and with Guix we can at least use distribution agnostic binaries that require no admin rights to install and run.

I'm not following you – nothing I'm aware of requires Docker; I do not run sambamba in Docker. It is being used for CI builds/releases, and with/without symbols is already supported.

What I'm suggesting is that the segfault symptom may be hiding a different underlying issue. For example, if you can trigger a segfault by providing an invalid command line, you may have the same issue I experienced with the official binaries a few months back (again: not Docker-related). You seemed to suggest that this was the case – sambamba view without a filename triggered it. What you should see here is an error message, but for me there was a segfault due to something going wrong in exception handling – outside of sambamba. When I compiled with LDC 1.0.0, itself compiled with LLVM 3.7.1, not Docker related, this problem went away and I got proper exceptions/error messages as intended by sambamba.

If it's already clear to you that there's a "real" segfault - i.e. sambamba itself is dereferencing NULL or similar, then disregard my comments.

Okay. False alarm.. it seems I am still segfaulting depending on what machine I end up:

/jobfs/local/pbs/mom_priv/jobs/1893858.r-man2.SC: line 12: 25567 Segmentation fault      /home/563/omh563/test/sambamba-0.6.5-5a33d57-14jbgb47fspy1l69hn30p/bin/sambamba sort -N -t 10 -m 1G --tmpdir=txtmp-test-dedup-nsort -o test-dedup-nsort.bam test-dedup.bam

I've not been able to run this with gdb:

$ gdb --args sambamba sort -N -t 8 -m 1G --tmpdir=txtmp-test-

This GDB was configured as "x86_64-redhat-linux-gnu".

Reading symbols from /g/data3/gx8/local/share/bcbio/anaconda/bin/sambamba...(no debugging symbols found)...done.

(gdb) run
Starting program: /g/data3/gx8/local/share/bcbio/anaconda/bin/sambamba sort -N -t 8 -m 1G --tmpdir=txtmp-test-dedup-nsort -o test-dedup-nsort.bam test-dedup.bam
Missing separate debuginfo for /home/563/omh563/test/glibc-2.23-m9vxvhdj691bq1f85lpfl/lib/ld-linux-x86-64.so.2
warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7ffff7ffa000
warning: Could not load shared library symbols for linux-vdso.so.1.
Do you need "set solib-search-path" or "set sysroot"?
Missing separate debuginfo for /home/563/omh563/test/glibc-2.23-m9vxvhdj691bq1f85lpfl/lib/librt.so.1
Missing separate debuginfo for /home/563/omh563/test/glibc-2.23-m9vxvhdj691bq1f85lpfl/lib/libpthread.so.0
[Thread debugging using libthread_db enabled]
Missing separate debuginfo for /home/563/omh563/test/glibc-2.23-m9vxvhdj691bq1f85lpfl/lib/libm.so.6
Missing separate debuginfo for /home/563/omh563/test/gcc-4.9.3-lib-if3ww39qs6267acvl2l9a/lib/libgcc_s.so.1
Missing separate debuginfo for /home/563/omh563/test/glibc-2.23-m9vxvhdj691bq1f85lpfl/lib/libc.so.6
[New Thread 0x7ffff6ede700 (LWP 25202)]
[New Thread 0x7fffebfff700 (LWP 25203)]
[New Thread 0x7fffe7ffe700 (LWP 25204)]
[New Thread 0x7fffd3fff700 (LWP 25205)]
[New Thread 0x7fffcfffe700 (LWP 25206)]
[New Thread 0x7fffcbffd700 (LWP 25207)]
[New Thread 0x7fffb7fff700 (LWP 25208)]
[New Thread 0x7fffb3ffe700 (LWP 25209)]

Program received signal SIGUSR1, User defined signal 1.
[Switching to Thread 0x7fffb3ffe700 (LWP 25209)]
0x00007ffff79c207f in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /home/563/omh563/test/glibc-2.23-m9vxvhdj691bq1f85lpfl/lib/libpthread.so.0

(gdb) bt
#0  0x00007ffff79c207f in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /home/563/omh563/test/glibc-2.23-m9vxvhdj691bq1f85lpfl/lib/libpthread.so.0
#1  0x000000000069b650 in core.sync.condition.Condition.wait() ()
#2  0x0000000000689ed4 in std.parallelism.TaskPool.pop() ()
#3  0x0000000000689da8 in std.parallelism.TaskPool.executeWorkLoop() ()
#4  0x000000000069764b in thread_entryPoint ()
#5  0x00007ffff79bc434 in start_thread () from /home/563/omh563/test/glibc-2.23-m9vxvhdj691bq1f85lpfl/lib/libpthread.so.0
#6  0x00007ffff6fc6d8d in clone () from /home/563/omh563/test/glibc-2.23-m9vxvhdj691bq1f85lpfl/lib/libc.so.6

This fails right away whereas the production run crash happens after writing the sorted file to disk. I suspect the missing libraries cause issues when trying to debug the threaded run?

They should not cause issues as all libraries are in fact found. What is missing are debug symbols which were in a separate download. Anyway, I think the good news is that we are homing in on the problem which looks a bit like http://www.digitalmars.com/d/archives/digitalmars/D/learn/GC_dead-locking_47223.html

In the next step I'll send you a download of a new sambamba build with debug symbols included and instructions. We try again and should get the point where it fails in sambamba itself dumping stack traces of all threads.

A few things here:

  • you can install the debug symbols as mentioned above. should be easily available via distribution packages.
  • SIGUSR1/2 are normal with the GC, you should disable them in GDB: handle SIGUSR1 SIGUSR2 nostop noprint

@sambrightman Thanks, the handle command was helpful:

Missing separate debuginfo for /home/563/omh563/test/glibc-2.23-m9vxvhdj691bq1f85lpfl/lib/libpthread.so.0
[Thread debugging using libthread_db enabled]
Missing separate debuginfo for /home/563/omh563/test/glibc-2.23-m9vxvhdj691bq1f85lpfl/lib/libm.so.6
Missing separate debuginfo for /home/563/omh563/test/gcc-4.9.3-lib-if3ww39qs6267acvl2l9a/lib/libgcc_s.so.1
Missing separate debuginfo for /home/563/omh563/test/glibc-2.23-m9vxvhdj691bq1f85lpfl/lib/libc.so.6
[New Thread 0x7ffff6ede700 (LWP 20619)]
[New Thread 0x7ffff2edd700 (LWP 20620)]
[New Thread 0x7fffe7fff700 (LWP 20621)]
[New Thread 0x7fffd3fff700 (LWP 20622)]
[New Thread 0x7fffcfffe700 (LWP 20623)]
[New Thread 0x7fffc3fff700 (LWP 20624)]
[New Thread 0x7fffbfffe700 (LWP 20625)]
[New Thread 0x7fffb3fff700 (LWP 20626)]
[New Thread 0x7fffa7fff700 (LWP 20627)]
[New Thread 0x7fff93fff700 (LWP 20628)]

[Thread 0x7fff93fff700 (LWP 20628) exited]
[Thread 0x7fffd3fff700 (LWP 20622) exited]
[Thread 0x7ffff6ede700 (LWP 20619) exited]
[Thread 0x7fffc3fff700 (LWP 20624) exited]
[Thread 0x7fffa7fff700 (LWP 20627) exited]
[Thread 0x7fffe7fff700 (LWP 20621) exited]
[Thread 0x7fffcfffe700 (LWP 20623) exited]
[Thread 0x7fffbfffe700 (LWP 20625) exited]
[Thread 0x7fffb3fff700 (LWP 20626) exited]
[Thread 0x7ffff2edd700 (LWP 20620) exited]

Program received signal SIGSEGV, Segmentation fault.
0x00000000006a4254 in rt.sections_elf_shared.finiTLSRanges() ()
(gdb)
(gdb) bt
#0  0x00000000006a4254 in rt.sections_elf_shared.finiTLSRanges() ()
#1  0x00000000006a3ebc in rt.tlsgc.destroy() ()
#2  0x0000000000697996 in core.thread.Thread.__dtor() ()
#3  0x00000000006a782e in rt_finalize2 ()
#4  0x00000000006b21aa in gc.gc.Gcx.sweep() ()
#5  0x00000000006b0517 in gc.gc.Gcx.fullcollect() ()
#6  0x00000000006b07b7 in gc.gc.GC.__T9runLockedS56_D2gc2gc2GC18fullCollectNoStackMFNbZ2goFNbPS2gc2gc3GcxZmTPS2gc2gc3GcxZ.runLocked() ()
#7  0x00000000006a01a0 in gc_term ()
#8  0x00000000006a4acb in rt_term ()
#9  0x00000000006a4d49 in _d_run_main ()
#10 0x000000000060fe05 in main ()

@pjotrp - over to you.

dlang/druntime@b22d813 looks familiar, especially since it only contains one instruction. Let me check what is in the runtime.

Yes, the instruction still sits in the ldc runtime we use. I'll create an update.

@ohofmann thanks. I'll send you another one to try soon. Question, have you isolated the machine this happens on, or is it random on the cluster?

comment on dlang/druntime@b22d813

I built sambamba with the latest ldc and runtime 1.1.0.

Instructions for installing Guix relocatable sambamba and debugging

Fetch

wget http://biogems.info//contrib/genenetwork/kywyw1q5cmblj30yskyrdmpn87059xi4-sambamba-0.6.6-pre1-91096e7-debug-x86_64.tar.bz2

The md5sum should be

ea7e11f38c1983595323ceff40cd4fb3  kywyw1q5cmblj30yskyrdmpn87059xi4-sambamba-0.6.6-pre1-91096e7-debug-x86_64.tar.bz2

unpack

mkdir tmp
cd tmp
tar xvjf ../kywyw1q5cmblj30yskyrdmpn87059xi4-sambamba-0.6.6-pre1-91096e7-debug-x86_64.tar.bz2

run the installer with target dir, e.g.

./install.sh ~/opt/sambamba-debug

Now you should be able to run

~/opt/sambamba-debug/sambamba-0.6.6-pre1-91096e7-kywyw1q5cmblj3/bin/sambamba --version
sambamba 0.6.6-pre1

This version was built with:
 LDC 1.1.0
 using DMD v2.071.2
 using LLVM 3.7.1
 bootstrapped with LDC - the LLVM D compiler (0.17.1)

To run with debugger you should see

gdb --args ~/opt/sambamba-debug/sambamba-0.6.6-pre1-91096e7-kywyw1q5cmblj3/bin/sambamba view --throw-error

it will complain about a CRC, so we need to fetch the original debug file with the command

(gdb) symbol-file gnu/store/kywyw1q5cmblj30yskyrdmpn87059xi4-sambamba-0.6.6-pre1-91096e7/bin/sambamba.debug
Reading symbols from gnu/store/kywyw1q5cmblj30yskyrdmpn87059xi4-sambamba-0.6.6-pre1-91096e7/bin/sambamba.debug...done.

now run

(gdb) handle SIGUSR1 SIGUSR2 nostop noprint
(gdb) run

and you should see

 Program received signal SIGSEGV, Segmentation fault.
 0x000000000051a0c8 in sambamba.view.view_main(immutable(char)[][]) (args=...) at view.d:151

if you see view.d:151 the symbols are loaded correctly (@sambrightman: I cause the segfault)

If we get another segfault, show us all threads with

 (gdb) thread apply all backtrace full

If you want listings you can add the source directory, also included with the installer, e.g.

 (gdb) directory gnu/store/z6c5c9zxvk5glgwd519wkfmi399x5x7h-sambamba-0.6.6-pre1-91096e7-checkout/sambamba

which shows the actual line

  Program received signal SIGSEGV, Segmentation fault.
  0x000000000051a0c8 in sambamba.view.view_main(immutable(char)[][]) (args=...) at view.d:151
  151             *p = 'X'; // force an exception

@ohofmann I hope this version of sambamba+ldc fixes our issue

Thanks @pjotrp -- can you check the install.sh? Fails for me with a Error: File exists @ dir_s_mkdir - /home/563/omh563/debug/glibc-2.23-m9vxvhdj691bq1f85lpf when pointing at an empty directory.

I'll check. Can you run ./install.sh -v -d TARGETDIR so I get the full stack trace? Note that you need to remove the old folder TARGETDIR before installing (it won't overwrite).

Of course:

gnu-install-bin 0.0.1-pre1 Copyright (C) 2017 Pjotr Prins <pjotr.prins@thebird.nl>

DEBUG Message: {:strategy=>:fixed, :show_help=>false, :verbose=>true, :debug=>true, :guix_relocate=>"./installer/bin/guix-relocate", :patchelf=>"./installer/bin/patchelf"}
DEBUG Exec {:strategy=>:fixed, :show_help=>false, :verbose=>true, :debug=>true, :guix_relocate=>"./installer/bin/guix-relocate", :patchelf=>"./installer/bin/patchelf"}
["/home/563/omh563/debug"]
DEBUG Installer: {:strategy=>:fixed, :show_help=>false, :verbose=>true, :debug=>true, :guix_relocate=>"./installer/bin/guix-relocate", :patchelf=>"./installer/bin/patchelf"}
Got target dir /home/563/omh563/debug
Expand target dir to /home/563/omh563/debug and create
Checking directory structure of ./installer/bin
Processing files...
WARNING Symlink gnu/store/0mkxmwcykgz7dknap50wn4nfhh0kl8j4-tzdata-2015g/share/zoneinfo/posix is not valid
WARNING Symlink guix-sambamba-debug-0.6.5-x86_64/gnu/store/sncfamh3fjrrdgd950d79g6yml2s6a07-libffi-3.2.1/include points to nothing in the store!

...

WARNING Symlink guix-sambamba-debug-0.6.5-x86_64/gnu/store/svg3k07grccgn6a2k4k7k6hcqqlynx9j-profile/share/locale/ro/LC_MESSAGES/sed.mo points to nothing in the store!
WARNING Symlink guix-sambamba-debug-0.6.5-x86_64/gnu/store/svg3k07grccgn6a2k4k7k6hcqqlynx9j-profile/share/locale/ro/LC_MESSAGES/bfd.mo points to nothing in the store!
WARNING Symlink guix-sambamba-debug-0.6.5-x86_64/gnu/store/svg3k07grccgn6a2k4k7k6hcqqlynx9j-profile/share/locale/ro/LC_MESSAGES/bash.mo points to nothing in the store!
WARNING Symlink guix-sambamba-debug-0.6.5-x86_64/gnu/store/svg3k07grccgn6a2k4k7k6hcqqlynx9j-profile/share/locale/ro/LC_MESSAGES/binutils.mo points to nothing in the store!
WARNING Symlink guix-sambamba-debug-0.6.5-x86_64/gnu/store/svg3k07grccgn6a2k4k7k6hcqqlynx9j-profile/share/locale/ro/LC_MESSAGES/findutils.mo points to nothing in the store!
WARNING Symlink guix-sambamba-debug-0.6.5-x86_64/gnu/store/svg3k07grccgn6a2k4k7k6hcqqlynx9j-profile/share/locale/ro/LC_MESSAGES/coreutils.mo points to nothing in the store!
WARNING Symlink guix-sambamba-debug-0.6.5-x86_64/gnu/store/svg3k07grccgn6a2k4k7k6hcqqlynx9j-profile/share/locale/ro/LC_MESSAGES/grep.mo points to nothing in the store!
WARNING Symlink guix-sambamba-debug-0.6.5-x86_64/gnu/store/svg3k07grccgn6a2k4k7k6hcqqlynx9j-profile/share/locale/ro/LC_MESSAGES/tar.mo points to nothing in the store!
WARNING Symlink guix-sambamba-debug-0.6.5-x86_64/gnu/store/svg3k07grccgn6a2k4k7k6hcqqlynx9j-profile/share/locale/ro/LC_MESSAGES/opcodes.mo points to nothing in the store!
WARNING Symlink guix-sambamba-debug-0.6.5-x86_64/gnu/store/svg3k07grccgn6a2k4k7k6hcqqlynx9j-profile/share/locale/ro/LC_TIME/coreutils.mo points to nothing in the store!
WARNING Symlink guix-sambamba-debug-0.6.5-x86_64/gnu/store/svg3k07grccgn6a2k4k7k6hcqqlynx9j-profile/share/awk points to nothing in the store!
WARNING Symlink guix-sambamba-debug-0.6.5-x86_64/gnu/store/cny25x3kx0z7c93dsnd9vsxasjgln76d-acl-2.2.52/libexec/libacl.so points to nothing in the store!
WARNING Symlink guix-sambamba-debug-0.6.5-x86_64/gnu/store/cny25x3kx0z7c93dsnd9vsxasjgln76d-acl-2.2.52/lib/libacl.a points to nothing in the store!
WARNING Symlink guix-sambamba-debug-0.6.5-x86_64/gnu/store/cny25x3kx0z7c93dsnd9vsxasjgln76d-acl-2.2.52/lib/libacl.la points to nothing in the store!
WARNING Symlink guix-sambamba-debug-0.6.5-x86_64/gnu/store/0jx4nqk33vd8xsgfkfay9vx4zv9pacd0-libffi-3.2.1/include points to nothing in the store!
WARNING Symlink guix-sambamba-debug-0.6.5-x86_64/gnu/store/0mkxmwcykgz7dknap50wn4nfhh0kl8j4-tzdata-2015g/share/zoneinfo/posix is not valid
Resolving references...
Copy files...
DEBUG Skipping /home/563/omh563/debug/gnu/store
./installer/bin/gnu-install-bin:231:in `mkdir': File exists @ dir_s_mkdir - /home/563/omh563/debug/glibc-2.23-m9vxvhdj691bq1f85lpf (Errno::EEXIST)
        from ./installer/bin/gnu-install-bin:231:in `block in <main>'
        from ./installer/bin/gnu-install-bin:208:in `each'
        from ./installer/bin/gnu-install-bin:208:in `<main>'
Done

That is what I thought. You may need to do

rm -rf ~/debug 

beforehand. Try using a different prefix to test. I'll add a --force switch to the installer and test again.

@pjotrp , the directory doesn't exist beforehand. If it does the installer complains right away:

omh563@raijin4:~/install$ ./install.sh -v -d ~/debug           
gnu-install-bin 0.0.1-pre1 Copyright (C) 2017 Pjotr Prins <pjotr.prins@thebird.nl>

DEBUG Message: {:strategy=>:fixed, :show_help=>false, :verbose=>true, :debug=>true, :guix_relocate=>"./installer/bin/guix-relocate", :patchelf=>"./installer/bin/patchelf"}
DEBUG Exec {:strategy=>:fixed, :show_help=>false, :verbose=>true, :debug=>true, :guix_relocate=>"./installer/bin/guix-relocate", :patchelf=>"./installer/bin/patchelf"}
["/home/563/omh563/debug"]
DEBUG Installer: {:strategy=>:fixed, :show_help=>false, :verbose=>true, :debug=>true, :guix_relocate=>"./installer/bin/guix-relocate", :patchelf=>"./installer/bin/patchelf"}
Got target dir /home/563/omh563/debug
Expand target dir to /home/563/omh563/debug and create
./installer/bin/gnu-install-bin:104:in `mkdir': File exists @ dir_s_mkdir - /home/563/omh563/debug (Errno::EEXIST)
	from ./installer/bin/gnu-install-bin:104:in `<main>'
Done

If it doesn't it creates the directory and lots of subdirs and starts setting things up, but fails at the glibc dir.

Thanks for the detailed report. I'll look into it.

It is a strange error because that directory is only created once. Somehow the file system reports it (still) exists. To nail it down I updated the tarball - md5sum 52def24ae8371a3b286252fddd7cf2f5, please run with

./install.sh TARGET -v -d --force

you can send the log to pjotr.public01@thebird.nl

After a bit of offline back and forth ended up with a new trace:

Thread 1 (process 10344):
#0  0x00000000006bb7c4 in rt.sections_elf_shared.finiTLSRanges() ()
No symbol table info available.
#1  0x00007fffec0008f0 in ?? ()
No symbol table info available.
#2  0x00000000006bb42c in rt.tlsgc.destroy() ()
No symbol table info available.
#3  0x00007ffff7ef1500 in ?? ()
No symbol table info available.
#4  0x00000000006aecf6 in core.thread.Thread.__dtor() ()
No symbol table info available.
#5  0x00007ffff7ef1500 in ?? ()
No symbol table info available.
#6  0x00000000006bed9e in rt_finalize2 ()
No symbol table info available.
#7  0x0000000000000050 in ?? () at /tmp/guix-build-sambamba-0.6.6-pre1-91096e7.drv-0/source/BioD/bio/core/sequence.d:42
No locals.
#8  0x0000000000000050 in ?? () at /tmp/guix-build-sambamba-0.6.6-pre1-91096e7.drv-0/source/BioD/bio/core/sequence.d:42
No locals.
#9  0x00007ffff7ef1500 in ?? ()
No symbol table info available.
#10 0x0000000000000010 in ?? () at /tmp/guix-build-sambamba-0.6.6-pre1-91096e7.drv-0/source/BioD/bio/core/sequence.d:42
No locals.
#11 0x00000000006c971a in gc.gc.Gcx.sweep() ()
No symbol table info available.
#12 0x0000000000032f7c in ?? ()
No locals.
#13 0x00000000009d70d0 in ?? ()
No symbol table info available.
#14 0x0000000000000018 in ?? () at /tmp/guix-build-sambamba-0.6.6-pre1-91096e7.drv-0/source/BioD/bio/core/sequence.d:42
No locals.
#15 0x0000000000000000 in ?? ()
No symbol table info available.

Looks like the upstream fix related to dlang/druntime@b22d813 in latest Druntime did not fix the issue. I am going to patch out the instruction that throws the exception and we retry.

New version appears to fix the issue:

 http://biogems.info/contrib/genenetwork/z9imqq7aybingq831ij4wpd3j38xxzaf-sambamba-0.6.6-pre2-91096e7-debug-x86_64.tar.bz2
 md5sum 3b93a34e96ab38d6fcbd6706d419cbc9

@ohofmann what is the status?

I added a new binary install of sambamba 0.6.6-pre3 with debug information on https://github.com/pjotrp/sambamba#troubleshooting. The issue around read sorting in 'sambamba depth' should be resolved by 48ac7aa.

Please test this version on your HPC. When it works we'll make a proper release which will run faster.

Fixed in 0.6.6-pre3

Note that you need the patched ldc compiler to have this fix.

https://blog.cloudflare.com/however-improbable-the-story-of-a-processor-bug/

led us to suspect the problem is an Intel bug:

On Mon, Jan 22, 2018 at 11:39:39AM +1100, Oliver Hofmann wrote:

Close enough â Intel Xeon E5-2690 v4 on the nodes that crashed.

That processor is listed as having this problem. I think we have found
it! It is called BDF76 An Intel® Hyper-Threading Technology Enabled
Processor May Exhibit Internal Parity Errors or Unpredictable System
Behavior”. The symptoms described for this issue are very broad
(“unpredictable system behavior may occur”), but what we were
observing seemed to match the description of this issue better than
any other.

A microcode update is available. Also switching off hyperthreading
in the bios should fix it. There is no LLVM fix that I can find (Intel
says there is no possible fix available).

See also

https://www.digitaltrends.com/computing/intel-hyperthreading-bug-kaby-skylake/

We had two bugs. One is fixed in ldc dlang/druntime#1655. The other is an unfixable intel xeon bug, see #335 for more information.