Stochastic (?) SSU Segfault

This issue is an attempt to document this elusive segfault.

Environment:

AWS EC2 AMI ami-45224425 in us-west-2
Conda environment for QIIME 2 2019.1
unifrac 0.9.3 py36h5a278d6_0 bioconda

Steps to recreate:

Create environment as above (sorry, that is a bit of a cop-out)
ulimit -c unlimited
debug.zip
for run in {1..100}; do ssu -i table.biom -t tree.nwk -m unweighted -o foo -n 8 & done (this seems to segfault for me at least once each time, YMMV)
gdb ssu core

GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ssu...done.
[New LWP 31561]
[New LWP 31566]
[New LWP 31562]
[New LWP 31560]
[New LWP 31353]
[New LWP 31565]
[New LWP 31563]
[New LWP 31564]
[New LWP 31559]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `ssu -i table.biom -t tree.nwk -m unweighted -o foo -n 8'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00005598ea7e542b in su::unifrac(su::biom&, su::BPTree&, su::Method, std::vector<double*, std::allocator<double*> >&, std::vector<double*, std::allocator<double*> >&, su::task_parameters const*) ()
[Current thread is 1 (Thread 0x7fbb90672700 (LWP 31561))]
(gdb) bt full
#0  0x00005598ea7e542b in su::unifrac(su::biom&, su::BPTree&, su::Method, std::vector<double*, std::allocator<double*> >&, std::vector<double*, std::allocator<double*> >&, su::task_parameters const*) ()
No symbol table info available.
#1  0x00007fbb9508b447 in std::execute_native_thread_routine (__p=0x5598ebb53330)
    at /home/rdonnelly/mc/conda-bld/compilers_linux-64_1534627447954/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/src/c++11/thread.cc:83
No locals.
#2  0x00007fbb94cfc6ba in start_thread (arg=0x7fbb90672700) at pthread_create.c:333
        __res = <optimized out>
        pd = 0x7fbb90672700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140443558291200, 3707310354445776027, 0, 140727154066159, 140443558291904,
                140727154066784, -3745829484106992485, -3745821988088433509}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0},
            data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
        __PRETTY_FUNCTION__ = "start_thread"
#3  0x00007fbb9452541d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

# cpuinfo
processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 79
model name      : Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
stepping        : 1
microcode       : 0xb000033
cpu MHz         : 2300.042
cache size      : 46080 KB
physical id     : 0
siblings        : 8
core id         : 3
cpu cores       : 4
apicid          : 7
initial apicid  : 7
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm fsgsbase bmi1 avx2 smep bmi2 erms invpcid xsaveopt
bugs            :
bogomips        : 4600.08
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

I tried to recreate this in a docker container running locally and was unable to.

Potentially related:
https://busywork.qiime2.org/teams/main/pipelines/master-distribution/jobs/unit-test-q2-diversity/builds/50

More I suppose:
https://busywork.qiime2.org/teams/main/pipelines/master-distribution/jobs/unit-test-q2-diversity/builds/6

Using the test data from above, I can see the following:

 valgrind --leak-check=yes ssu -i table.biom -t tree.nwk -m unweighted -o foo -n 4
==13903== Memcheck, a memory error detector
==13903== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==13903== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==13903== Command: ssu -i table.biom -t tree.nwk -m unweighted -o foo -n 4
==13903== 
==13903== Thread 3:
==13903== Invalid read of size 1
==13903==    at 0x12942B: su::unifrac(su::biom&, su::BPTree&, su::Method, std::vector<double*, std::allocator<double*> >&, std::vector<double*, std::allocator<double*> >&, su::task_parameters const*) (in /home/evan/.conda/envs/q2-dev/bin/ssu)
==13903==    by 0x4954407: execute_native_thread_routine (thread.cc:80)
==13903==    by 0x4864FA9: start_thread (in /lib64/libpthread-2.29.so)
==13903==    by 0x4C3671E: clone (in /lib64/libc-2.29.so)
==13903==  Address 0x1 is not stack'd, malloc'd or (recently) free'd
==13903== 
==13903== 
==13903== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==13903==  Access not within mapped region at address 0x1
==13903==    at 0x12942B: su::unifrac(su::biom&, su::BPTree&, su::Method, std::vector<double*, std::allocator<double*> >&, std::vector<double*, std::allocator<double*> >&, su::task_parameters const*) (in /home/evan/.conda/envs/q2-dev/bin/ssu)
==13903==    by 0x4954407: execute_native_thread_routine (thread.cc:80)
==13903==    by 0x4864FA9: start_thread (in /lib64/libpthread-2.29.so)
==13903==    by 0x4C3671E: clone (in /lib64/libc-2.29.so)
==13903==  If you believe this happened as a result of a stack
==13903==  overflow in your program's main thread (unlikely but
==13903==  possible), you can try to increase the size of the
==13903==  main thread stack using the --main-stacksize= flag.
==13903==  The main thread stack size used in this run was 8388608.
==13903== 
==13903== HEAP SUMMARY:
==13903==     in use at exit: 52,197,210 bytes in 9,572 blocks
==13903==   total heap usage: 468,613 allocs, 459,041 frees, 158,158,387 bytes allocated
==13903== 
==13903== Thread 1:
==13903== 1,152 bytes in 4 blocks are possibly lost in loss record 3,735 of 3,848
==13903==    at 0x4838B65: calloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==13903==    by 0x4011E51: allocate_dtv (in /lib64/ld-2.29.so)
==13903==    by 0x401279D: _dl_allocate_tls (in /lib64/ld-2.29.so)
==13903==    by 0x4865C1E: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread-2.29.so)
==13903==    by 0x495458A: __gthread_create (gthr-default.h:675)
==13903==    by 0x495458A: std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) (thread.cc:135)
==13903==    by 0x128287: su::process_stripes(su::biom&, su::BPTree&, su::Method, bool, std::vector<double*, std::allocator<double*> >&, std::vector<double*, std::allocator<double*> >&, std::vector<std::thread, std::allocator<std::thread> >&, std::vector<su::task_parameters, std::allocator<su::task_parameters> >&) (in /home/evan/.conda/envs/q2-dev/bin/ssu)
==13903==    by 0x12F1D6: one_off (in /home/evan/.conda/envs/q2-dev/bin/ssu)
==13903==    by 0x11CC09: mode_one_off(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, double, bool, unsigned int) (in /home/evan/.conda/envs/q2-dev/bin/ssu)
==13903==    by 0x118E79: main (in /home/evan/.conda/envs/q2-dev/bin/ssu)
==13903== 
==13903== LEAK SUMMARY:
==13903==    definitely lost: 0 bytes in 0 blocks
==13903==    indirectly lost: 0 bytes in 0 blocks
==13903==      possibly lost: 1,152 bytes in 4 blocks
==13903==    still reachable: 52,196,058 bytes in 9,568 blocks
==13903==         suppressed: 0 bytes in 0 blocks
==13903== Reachable blocks (those to which a pointer was found) are not shown.
==13903== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==13903== 
==13903== For lists of detected and suppressed errors, rerun with: -s
==13903== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)
Segmentation fault (core dumped)

Address 0x1 is not stack'd, malloc'd or (recently) free'd

This is just baffling to me... if it was 0x0 everything would be "normal". I just don't see how we would end up with un-initialized values for the threaded su::unifrac. It looks like there are debug symbols here, but maybe valgrind is just entirely off-base.

So this bug is coming from this line:

unifrac/sucpp/unifrac.cpp

Line 512 in 55e52af

if(__builtin_expect(report_status[task_p->tid], false)) {

This is used because any time that thread 1 (taskp->tid == 1)starts running before thread 0 (taskp->tid == 0), report_status is used uninitialized:

unifrac/sucpp/unifrac.cpp

Lines 322 to 328 in 55e52af

    
           if(task_p->tid == 0) { 
        
               if (signal(SIGUSR1, sig_handler) == SIG_ERR) 
        
                   fprintf(stderr, "Can't catch SIGUSR1\n"); 
        
               report_status = (bool*)calloc(sizeof(bool), CPU_SETSIZE); 
        
               pthread_mutex_init(&printf_mutex, NULL); 
        
           }

which makes sense why this is relatively stochastic
@wasade maybe you can comment on whether this report_status functionality is even needed/used?

You are amazing @gwarmstrong!

	if(task_p->tid == 0) {
	if (signal(SIGUSR1, sig_handler) == SIG_ERR)
	fprintf(stderr, "Can't catch SIGUSR1\n");

	report_status = (bool*)calloc(sizeof(bool), CPU_SETSIZE);
	pthread_mutex_init(&printf_mutex, NULL);
	}