COMBINE-lab/pufferfish

make install does not work, pufferfish binary segfaults

mathog opened this issue · 20 comments

Pufferfish does not "make install" fully. This is on CentOS 8.

  package=pufferfish #from current_version.txt in download
  pversion=1.0.0
  TOPDIR=/usr/common/modules/el8/x86_64/software/${package}/${pversion}-CentOS-vanilla
  THELUA=/usr/common/modules/el8/x86_64/modules/all/${package}/${pversion}-CentOS-vanilla.lua
  #as modules on 2020/06/09 for CentOS 8
  cd /usr/common/src
  git clone https://github.com/COMBINE-lab/pufferfish.git
  mv ${package} ${package}-${pversion}
  cd  ${package}-${pversion}
  #Findjemalloc.cmake is broken, use the one from Salmon
  cd cmake/Modules
  mv FindJemalloc.cmake Hide_FindJemalloc.cmake
  cp  ../../../salmon-1.2.1/cmake/Modules/FindJemalloc.cmake .
  cd ../..
  mkdir build
  cd build
  module load htslib #apparently ignored, it built its own anyway
  cmake -DCMAKE_INSTALL_PREFIX=$TOPDIR .. | tee cmake_2020_06_09.log
  make 2>&1 | tee build_2020_06_09.log
  make install 2>&1 | tee install_2020_06_09.log
  #does not install everything
  mkdir -p $TOPDIR/bin
  cd src
  mv bcalm_pufferize cedar edgedensity edgedensity2 \
    filtersam krakmap pufferfish $TOPDIR/bin
  mv lib*.a $TOPDIR/lib64

After that these are the contents of $TOPDIR:

$TOPDIR/lib
$TOPDIR/lib/libntcard.a
$TOPDIR/lib/ntcard
$TOPDIR/lib/ntcard/ntcard-targets.cmake
$TOPDIR/lib/ntcard/ntcard-targets-release.cmake
$TOPDIR/lib/libgraphdump.a
$TOPDIR/lib/graphdump
$TOPDIR/lib/graphdump/graphdump-targets.cmake
$TOPDIR/lib/graphdump/graphdump-targets-release.cmake
$TOPDIR/lib/libtwopaco.a
$TOPDIR/lib/twopaco
$TOPDIR/lib/twopaco/twopaco-targets.cmake
$TOPDIR/lib/twopaco/twopaco-targets-release.cmake
$TOPDIR/cmake
$TOPDIR/cmake/Async++Config.cmake
$TOPDIR/cmake/Async++.cmake
$TOPDIR/cmake/Async++-release.cmake
$TOPDIR/lib64
$TOPDIR/lib64/libasync++.a
$TOPDIR/lib64/libksw2pp.a
$TOPDIR/lib64/libpuffer.a
$TOPDIR/include
$TOPDIR/include/async++.h
$TOPDIR/include/async++
$TOPDIR/include/async++/aligned_alloc.h
$TOPDIR/include/async++/cancel.h
$TOPDIR/include/async++/continuation_vector.h
$TOPDIR/include/async++/parallel_for.h
$TOPDIR/include/async++/parallel_invoke.h
$TOPDIR/include/async++/parallel_reduce.h
$TOPDIR/include/async++/partitioner.h
$TOPDIR/include/async++/range.h
$TOPDIR/include/async++/ref_count.h
$TOPDIR/include/async++/scheduler.h
$TOPDIR/include/async++/scheduler_fwd.h
$TOPDIR/include/async++/task.h
$TOPDIR/include/async++/task_base.h
$TOPDIR/include/async++/traits.h
$TOPDIR/include/async++/when_all_any.h
$TOPDIR/bin
$TOPDIR/bin/bcalm_pufferize
$TOPDIR/bin/cedar
$TOPDIR/bin/edgedensity
$TOPDIR/bin/edgedensity2
$TOPDIR/bin/filtersam
$TOPDIR/bin/krakmap
$TOPDIR/bin/pufferfish

Unclear to me which if any of the files that install initially placed under $TOPDIR really should be there. The documentation for building pufferfish only describes up to "make".

The other issue - pufferfish built this way does not run.

cd src
./pufferfish
Segmentation fault (core dumped)
./pufferfish -h # or -help or --help
Segmentation fault(core dumped)
cat >/tmp/some.fasta <<'EOD'
>one
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>two
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
>three
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
>four
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
EOD
./pufferfish index -r /tmp/some.fasta -o /tmp
Segmentation fault (core dumped)
ldd ./pufferfish
        linux-vdso.so.1 (0x00007ffeee9c1000)
        libz.so.1 => /lib64/libz.so.1 (0x00007f99a3236000)
        libtbbmalloc_proxy.so.2 => /lib64/libtbbmalloc_proxy.so.2 (0x00007f99a3032000)
        libtbbmalloc.so.2 => /lib64/libtbbmalloc.so.2 (0x00007f99a2e01000)
        libtbb.so.2 => /lib64/libtbb.so.2 (0x00007f99a2bc3000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f99a29bf000)
        libjemalloc.so.2 => /lib64/libjemalloc.so.2 (0x00007f99a2520000)
        libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f99a218b000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f99a1e09000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f99a1bf1000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f99a19d1000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f99a160e000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f99a344d000)
        librt.so.1 => /lib64/librt.so.1 (0x00007f99a1405000)

rob-p commented

@mathog — thanks for the detailed report. @mohsenzakeri, can you please take a look into this? The easiest way may be to use a CentOS8 container to get the same environment to build a MWE that we can play with.

Yes @rob-p, I'll be looking into this now.

Hi @mathog,
Thanks again for providing the detailed report of the problem you were facing.

I followed all the commands you provided here to build a pufferfish binary from scratch in CentOS8 container created by "docker pull centos". (creates a container with the following version: CentOS Linux release 8.1.1911)

It seemed that at the end of the process, binary located at "$TOPDIR/bin/pufferfish" is working fine. It builds the index on the toy example you provided in this post as well. Are you running the binary at this location too?

It does the same thing no matter where I run it. Interestingly it runs in valgrind, but also throws some errors which are presumably causing the segfault when it runs alone. I rebuilt it like this (accidentally overwriting
yesterday's log files) to try to get some line numbers in the valgrind and gdb output:

  mkdir build
  cd build
  export CXXFLAGS="-g -O0"
  export CFLAGS="-g -O0"
  cmake \
    -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON \
    -DCMAKE_INSTALL_PREFIX=$TOPDIR \
    .. | tee cmake_2020_06_09.log
  make 2>&1 | tee build_2020_06_09.log
  make 2>&1 | tee build_2020_06_09.log
  cd ../bin
  valgrind  --track-origins=yes ./pufferfish index -r /tmp/some.fasta -o /tmp >/tmp/vg_puff.log 2>&1

Unfortunately not too many line numbers resulted. At least the routine names are visible. See the attached file.
vg_puff.log

I'm using a fairly old machine with only 8GB of RAM for this (it was a spare compute node I kept for parts but took home to work on during the epidemic.) Processor is "Intel(R) Core(TM) i3-2100 CPU @ 3.10GHz". I suppose something about the pipeline might be causing the difference between a crash on this and no crash on a newer machine. There are no "-march" flags, so it isn't compiling for something incompatible, and I verified that all the -mss* modes called for are supported by that CPU.

There are some lines like:

g++ -DHAVE_CONFIG_H -I. -I.. -I../ -I../htslib -Wno-sign-compare -g -Wno-unknown-pragmas '-std=c++14' -MT libseqlib_a-BamRecord.o -MD -MP -MF .deps/libseqlib_a-BamRecord.Tpo -c -o libseqlib_a-BamRecord.o test -f 'BamRecord.cpp' || echo './'BamRecord.cpp
which have an extra set of single quotes around the -std= clause. Probably that is harmless, but I do not recall seeing that before.

In any case, I suspect that those valgrind detected errors are the problem, and that it only causes segfaults on certain machines due to CPU or other differences. This is probably also why salmon blows up on this machine, as the valgrind messages for it include many of the same sections of code (and for some reason more line numbers):

valgrind --track-origins=yes /usr/common/src/salmon-1.2.1/build/src/salmon index -t transcripts.fasta -i sample_salmon_fmd_index --type puff >/tmp/vg_salmon.log 2>&1
vg_salmon.log

rob-p commented

Hi @mathog,

These logs are very useful; thank you! Two quick questions (we will look into the log). (1) Do the pre-compiled versions of salmon run on your machine? Specifically, either this binary, or the bioconda binary. (2) How were you successfully able to run valgrind? There is a specific (stated) incompatibility with jemalloc and valgrind (see e.g. here). This is the reason we can't reliably use valgrind when linking with jemalloc (though we do debug using valgrind by explicitly changing the CMake to not link jemalloc).

Thanks!
Rob

precompiled salmon binary does work. I believe the problem with valgrind and jemalloc concerns finding leaks when the program exits, but that wasn't what it showed here, rather an access violation or use of an uninitialized variable. The precompiled one has the same (or at least similar) valgrind messages. Whether it explodes or not would depend upon what happens to be at the improperly accessed locations, and that need not be the same for the two binaries.

valgrind --track-origins=yes /usr/common/src/salmon-latest_linux_x86_64/bin/salmon index -t transcripts.fasta -i sample_salmon_fmd_index --type puff >/tmp/vg_salmon_precompiled.log 2>&1

vg_salmon_precompiled.log

rob-p commented

I believe the problem with valgrind and jemalloc concerns finding leaks when the program exits, but that wasn't what it showed here, rather an access violation or use of an uninitialized variable.

I agree that the logs here show access to an uninitialized variable. That may, indeed, be happening (and we will look into this ASAP). I also agree this is an issue that would depend on program layout, and what is in the relevant locations in memory space when the program runs. However, my understanding from the relevant issues with jemalloc and valgrind is that they are simply incompatible in the large (not just for leak checking). Specifically, since both jemalloc and valgrind replace default system allocator calls, you cannot reliably use valgrind on an executable complied and linked against jemalloc and expect correct results. We've run into this when doing memory validation ourselves, where we've had to compile without jemalloc to even get valgrind to not immediately throw an error and kill the process.

Hi again @mathog,

Thanks again for the detailed report of the problem you are experiencing.
Could you please provide us with the specific fasta file you are using to create these logs too? That way we can more closely track this issue in pufferfish.

Thanks,
Mohsen

transcripts.fasta is from the sample_data.tgz in the salmon distribution. Running salmon the same way but with the "some.fasta" from the first post also is full of valgrind errors, but not exactly the same ones. So I don't think the sequence is the problem.

Yes. it is possible that the valgrind messages are just noise due to interactions with jemalloc. But since the binaries built here do segfault when run normally or in gdb there is must be something wrong with them which is in some way memory related.

So, it looked like the src code for salmon only used malloc, realloc, and calloc, and not any of the other jemalloc specific functions. So I relinked it without the jemalloc library (well, maybe, the usual link command but minus "/usr/lib64/libjemalloc.so"):

cd /usr/common/src/salmon-1.2.1/build/src
/usr/bin/c++  -g -O0 -O3 -DNDEBUG -flto -fno-fat-lto-objects   CMakeFiles/salmon.dir/EMUtils.cpp.o CMakeFiles/salmon.dir/CollapsedEMOptimizer.cpp.o CMakeFiles/salmon.dir/CollapsedCellOptimizer.cpp.o CMakeFiles/salmon.dir/CollapsedGibbsSampler.cpp.o CMakeFiles/salmon.dir/Salmon.cpp.o CMakeFiles/salmon.dir/BuildSalmonIndex.cpp.o CMakeFiles/salmon.dir/Graph.cpp.o CMakeFiles/salmon.dir/DedupUMI.cpp.o CMakeFiles/salmon.dir/Alevin.cpp.o CMakeFiles/salmon.dir/AlevinHash.cpp.o CMakeFiles/salmon.dir/SalmonAlevin.cpp.o CMakeFiles/salmon.dir/WhiteList.cpp.o CMakeFiles/salmon.dir/SalmonQuantify.cpp.o CMakeFiles/salmon.dir/FragmentLengthDistribution.cpp.o CMakeFiles/salmon.dir/FragmentStartPositionDistribution.cpp.o CMakeFiles/salmon.dir/GZipWriter.cpp.o CMakeFiles/salmon.dir/SalmonQuantMerge.cpp.o CMakeFiles/salmon.dir/ProgramOptionsGenerator.cpp.o CMakeFiles/salmon.dir/FASTAParser.cpp.o CMakeFiles/salmon.dir/AlignmentModel.cpp.o CMakeFiles/salmon.dir/SalmonQuantifyAlignments.cpp.o CMakeFiles/salmon.dir/BAMUtils.cpp.o  -o salmon   -L/usr/common/src/salmon-1.2.1/lib  -L/usr/common/src/salmon-1.2.1/external/install/lib  -L/usr/lib64/boost169  -Wl,-rpath,"\$ORIGIN/../lib:\$ORIGIN/../../lib:\$ORIGIN/:\$ORIGIN/../../external/install/lib" ../external/pufferfish/src/libpuffer.a libsalmon_core.a ../external/pufferfish/external/twopaco/graphconstructor/libtwopaco.a ../external/pufferfish/external/twopaco/graphdump/libgraphdump.a ../external/pufferfish/external/ntcard/libntcard.a -lgff -lboost_iostreams -lboost_filesystem -lboost_system -lboost_timer -lboost_chrono -lboost_program_options /usr/common/modules/el8/x86_64/software/io_lib/1.14.9-CentOS-vanilla/lib/libstaden-read.a /usr/lib64/libcurl.so /usr/lib64/libz.so -lm /usr/lib64/liblzma.so /usr/lib64/libbz2.so /usr/common/modules/el8/x86_64/software/libtbb/2020.5-CentOS-vanilla/lib/libtbbmalloc_proxy.so /usr/common/modules/el8/x86_64/software/libtbb/2020.5-CentOS-vanilla/lib/libtbbmalloc.so /usr/common/modules/el8/x86_64/software/libtbb/2020.5-CentOS-vanilla/lib/libtbb.so -lgomp -lrt ../external/pufferfish/src/libksw2pp.a libalevin_core.a -ldl -pthread 
ldd salmon | grep libjemalloc
#not linked to libjemalloc.so
cp salmon /tmp

and then ran without segfaulting like:


cd /tmp/salmon/data
../salmon  index -t transcripts.fasta -i sample_salmon_fmd_index --type puff 

and it worked. Probably has nothing to do with not having jemalloc in there, access violations that work in one place and not another often will flip between working and back on rebuilds or relinks. So next:

valgrind --track-origins=yes ../salmon index -t transcripts.fasta -i sample_salmon_fmd_index --type puff >/tmp/vg_salmon_nojemalloc.log 2>&1

vg_salmon_nojemalloc.log

Valgrind flagged the same lines in that as this (with jemalloc)

vg_salmon.log

rob-p commented

Hi @mathog,

Thanks again for all of the detailed information. This has helped us track down the source of these uninitialized reads. We have eliminated them, and valgrind now runs cleanly (with 0 reported errors) for pufferfish index. Interestingly, seeing where the uninitialized reads came from, I'm surprised that they would cause a segfault (I understand that this might happen because reading from uninitialized memory before it is written is an undefined behavior in C++, so all bets are off). However, none of these were out of bounds reads, and no logical decision was made on the basis of the uninitialized reads.

In one case, part of a byte at the end of compact_vector that was uninitialized could be written. All of the values in the vector were valid and written, but because they are storing elements of size < 1-byte, it is possible that the remaining part of the byte was allocated and not written prior to being read. This was the cause of the 2 valgrind errors early in the trace (in the serialize function). The valgrind errors at the end of the trace were due to a write operation on a thread-safe compact_vector which uses atomic compare and swap to write the values in the vector. Here, every entry in the vector is written to, but the compare and swap must obviously read the existing value to compare and do the swap if it finds what it is looking for. So, you cannot technically "write" to an uninitialized thread-safe compact_vector using the built in accessors or iterators without reading from it. Anyway, both of these errors were avoided by ensuring the memory is zero-initialized after allocation.

When you have a chance, could you please pull from the develop branch and let us know if you are able to build an executable that works properly on your machine, or at least if you are able to get farther than indexing? Note: If you test salmon, you will have to pull from the develop branch there too to get the develop branch of pufferfish, and you also have to clear the CMakeCache.txt to force salmon to re-download the pufferfish source it uses internally.

Rebuilt from develop branch. Test with

/usr/common/src/pufferfish-1.0.0d/build/src/pufferfish index -r /tmp/sample_data/transcripts.fasta -o /tmp/woobar

Segfaults. Relinked pufferfish without jemalloc.so (removed the two jemalloc.so entries on the linker command line) to make pufferfish_nojemalloc. Test that:

/usr/common/src/pufferfish-1.0.0d/build/src/pufferfish_nojemalloc index -r /tmp/sample_data/transcripts.fasta -o /tmp/woobar

Ran normally. Ran "valgrind --track-origins=yes" on both versions. BOTH ran to completion and neither saw any errors. (Attached.)
vg_pufferfish_nojemalloc.log
vg_pufferfish_jemalloc.log

Finally ran the jemalloc version in gdb,

`gdb --args /usr/common/src/pufferfish-1.0.0d/build/src/pufferfish index -r /tmp/sample_data/transcripts.fasta -o /tmp/woobar
GNU gdb (GDB) Red Hat Enterprise Linux 8.2-6.el8
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/common/src/pufferfish-1.0.0d/build/src/pufferfish...r
done.
(gdb) r
Starting program: /home/common/src/pufferfish-1.0.0d/build/src/pufferfish index -r /tmp/sample_data/transcripts.fasta -o /tmp/woobar
Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-72.el8_1.1.x86_64
warning: Loadable section ".note.gnu.property" outside of ELF segments
warning: Loadable section ".note.gnu.property" outside of ELF segments
warning: Loadable section ".note.gnu.property" outside of ELF segments
warning: Loadable section ".note.gnu.property" outside of ELF segments
warning: Loadable section ".note.gnu.property" outside of ELF segments
warning: Loadable section ".note.gnu.property" outside of ELF segments
warning: Loadable section ".note.gnu.property" outside of ELF segments
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff6f1a2ab in je_tcache_bin_flush_small () from /lib64/libjemalloc.so.2
Missing separate debuginfos, use: yum debuginfo-install jemalloc-5.2.1-2.el8.x86_64 libgcc-8.3.1-4.5.el8.x86_64 libstdc++-8.3.1-4.5.el8.x86_64 tbb-2018.2-9.el8.x86_64 zlib-1.2.11-10.el8.x86_64
(gdb) bt
#0 0x00007ffff6f1a2ab in je_tcache_bin_flush_small () from /lib64/libjemalloc.so.2
#1 0x00007ffff6eb972e in je_sdallocx_default () from /lib64/libjemalloc.so.2
#2 0x000000000042b19d in std::_Function_base::_Base_managerclipp::parameter::predicate_adapter::_M_destroy (
__victim=...) at /usr/include/c++/8/bits/std_function.h:257
#3 std::_Function_base::_Base_managerclipp::parameter::predicate_adapter::_M_manager (__dest=..., __source=...,
__op=__op@entry=std::__destroy_functor) at /usr/include/c++/8/bits/std_function.h:212
#4 0x000000000042e662 in std::_Function_base::~_Function_base (this=0x7ffff5d2edc0, __in_chrg=)
at /usr/include/c++/8/bits/std_function.h:257
#5 std::function<clipp::subrange (std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&)>::~function() (this=0x7ffff5d2edc0, __in_chrg=) at /usr/include/c++/8/bits/std_function.h:370
#6 clipp::parameter::~parameter (this=0x7ffff5d2ed08, __in_chrg=)
at /usr/common/src/pufferfish-1.0.0d/include/clipp.h:1882
#7 clipp::group::child_t<clipp::parameter, clipp::group>::destroy_content (this=0x7ffff5d2ed08)
at /usr/common/src/pufferfish-1.0.0d/include/clipp.h:2750
#8 clipp::group::child_t<clipp::parameter, clipp::group>::~child_t (this=0x7ffff5d2ed08, __in_chrg=)
at /usr/common/src/pufferfish-1.0.0d/include/clipp.h:2642
#9 std::_Destroy<clipp::group::child_t<clipp::parameter, clipp::group> > (__pointer=0x7ffff5d2ed08)
at /usr/include/c++/8/bits/stl_construct.h:98
#10 std::_Destroy_aux::__destroy<clipp::group::child_t<clipp::parameter, clipp::group>> (
__last=, __first=0x7ffff5d2ed08) at /usr/include/c++/8/bits/stl_construct.h:108
#11 std::_Destroy<clipp::group::child_t<clipp::parameter, clipp::group>
> (__last=,
__first=) at /usr/include/c++/8/bits/stl_construct.h:137
#12 std::_Destroy<clipp::group::child_t<clipp::parameter, clipp::group>, clipp::group::child_t<clipp::parameter, clip--Type for more, q to quit, c to continue without paging--
p::group> > (__last=0x7ffff5d2ee10, __first=) at /usr/include/c++/8/bits/stl_construct.h:206
#13 std::vector<clipp::group::child_t<clipp::parameter, clipp::group>, std::allocator<clipp::group::child_t<clipp::parameter, clipp::group> > >::~vector (this=0x7ffff5d61730, __in_chrg=)
at /usr/include/c++/8/bits/stl_vector.h:567
#14 clipp::group::~group (this=0x7ffff5d61708, __in_chrg=)
at /usr/common/src/pufferfish-1.0.0d/include/clipp.h:2576
#15 0x000000000042e900 in clipp::group::child_t<clipp::parameter, clipp::group>::destroy_content (
this=0x7ffff5d61708) at /usr/include/c++/8/ext/new_allocator.h:86
#16 clipp::group::child_t<clipp::parameter, clipp::group>::~child_t (this=0x7ffff5d61708, __in_chrg=)
at /usr/common/src/pufferfish-1.0.0d/include/clipp.h:2642
#17 std::_Destroy<clipp::group::child_t<clipp::parameter, clipp::group> > (__pointer=0x7ffff5d61708)
at /usr/include/c++/8/bits/stl_construct.h:98
#18 std::_Destroy_aux::__destroy<clipp::group::child_t<clipp::parameter, clipp::group>
> (
__last=, __first=0x7ffff5d61708) at /usr/include/c++/8/bits/stl_construct.h:108
#19 std::_Destroy<clipp::group::child_t<clipp::parameter, clipp::group>> (__last=,
__first=) at /usr/include/c++/8/bits/stl_construct.h:137
#20 std::_Destroy<clipp::group::child_t<clipp::parameter, clipp::group>
, clipp::group::child_t<clipp::parameter, clipp::group> > (__last=0x7ffff5d62050, __first=) at /usr/include/c++/8/bits/stl_construct.h:206
#21 std::vector<clipp::group::child_t<clipp::parameter, clipp::group>, std::allocator<clipp::group::child_t<clipp::parameter, clipp::group> > >::~vector (this=0x7fffffff9f68, __in_chrg=)
at /usr/include/c++/8/bits/stl_vector.h:567
#22 clipp::group::~group (this=0x7fffffff9f40, __in_chrg=)
at /usr/common/src/pufferfish-1.0.0d/include/clipp.h:2576
--Type for more, q to quit, c to continue without paging--
#23 0x000000000041b075 in main (argc=6, argv=0x7fffffffd788)
at /usr/common/src/pufferfish-1.0.0d/include/clipp.h:1544
`

I am trying to obtain the src.rpm for jemalloc to rebuild it. There seems to be something wrong with it (on my test system anyway).

rob-p commented

Thank you for the detailed new report @mathog! So this is behavior we have seen before. On our builds, its sort of rare, but every once in a while, we will get a jemalloc segfault from just that function. Everytime we attempt to link without jemalloc to use valgrind (or address and memory sanitizer), we get a clean profile. I am often reticent to point a finger at a wildely-used library like jemalloc, but I have been unable to reproduce such issues under the system allocator, or valgrinds, or even other multithreaded allocators like tcmalloc.

Clearly something strange is going on with this jemalloc. Today I rebuilt it and reinstalled it:

#as root
dnf download --source jemalloc
rpmbuild --rebuild jemalloc-5.2.1-2.el8.src.rpm 2>&1 | tee jemalloc.log
cd ~/rpmbuild/RPMS/x86_64
rpm -i --force jemalloc-5.2.1-2.el8.x86_64.rpm jemalloc-devel-5.2.1-2.el8.x86_64.rpm python3-tbb-2018.2-9.el8.x86_64.rpm tbb-2018.2-9.el8.x86_64.rpm tbb-devel-2018.2-9.el8.x86_64.rpm

Then reran the test case:

#as modules (normal user, owner of the binary)
 /usr/common/src/pufferfish-1.0.0d/build/src/pufferfish index -r /tmp/sample_data/transcripts.fasta -o /tmp/woobar
#segfaults, gdb rerun has same backtrace as before

But the rebuild also made the debug RPMS, so

#as root
rpm -i jemalloc-debuginfo-5.2.1-2.el8.x86_64.rpm jemalloc-debugsource-5.2.1-2.el8.x86_64.rpm

# as modules, rerun gdb

$ /usr/common/src/pufferfish-1.0.0d/build/src/pufferfish index -r /tmp/sample_data/transcripts.fasta -o /tmp/woobar
Segmentation fault (core dumped)
[modules@poweredge tmp]$ gdb --args /usr/common/src/pufferfish-1.0.0d/build/src/pufferfish index -r /tmp/sample_data/transcripts.fasta -o /tmp/woobar
GNU gdb (GDB) Red Hat Enterprise Linux 8.2-6.el8
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/common/src/pufferfish-1.0.0d/build/src/pufferfish...done.
(gdb) r
Starting program: /home/common/src/pufferfish-1.0.0d/build/src/pufferfish index -r /tmp/sample_data/transcripts.fasta -o /tmp/woobar
Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-72.el8_1.1.x86_64
warning: Loadable section ".note.gnu.property" outside of ELF segments
warning: Loadable section ".note.gnu.property" outside of ELF segments
warning: Loadable section ".note.gnu.property" outside of ELF segments
warning: Loadable section ".note.gnu.property" outside of ELF segments
warning: Loadable section ".note.gnu.property" outside of ELF segments
warning: Loadable section ".note.gnu.property" outside of ELF segments
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

[5]+  Stopped                 gdb --args /usr/common/src/pufferfish-1.0.0d/build/src/pufferfish index -r /tmp/sample_data/transcripts.fasta -o /tmp/woobar

That's right, adding the debug information resulted in gdb itself vaporizing during the run. That is not the sign of healthy software!

Let's revisit the CentOS container example from above. Where did the jemalloc used in that case come from? It isn't a standard part of the operating system. Mine came from the EPEL 8 repository. Perhaps the container one used a different version of jemalloc? This is what mine looks like (after install of rebuild):

rpm -qi jemalloc
Name        : jemalloc
Version     : 5.2.1
Release     : 2.el8
Architecture: x86_64
Install Date: Fri 12 Jun 2020 08:55:01 AM PDT
Group       : Unspecified
Size        : 871230
License     : BSD
Signature   : (none)
Source RPM  : jemalloc-5.2.1-2.el8.src.rpm
Build Date  : Fri 12 Jun 2020 08:49:04 AM PDT
Build Host  : poweredge.cluster
Relocations : (not relocatable)
URL         : http://www.canonware.com/jemalloc/
Summary     : General-purpose scalable concurrent malloc implementation
Description :
General-purpose scalable concurrent malloc(3) implementation.
This distribution is the stand-alone "portable" implementation of jemalloc.

dnf -y reinstall jemalloc jemalloc-devel
rpm -qi jemalloc
Name        : jemalloc
Version     : 5.2.1
Release     : 2.el8
Architecture: x86_64
Install Date: Fri 12 Jun 2020 09:21:08 AM PDT
Group       : Unspecified
Size        : 780757
License     : BSD
Signature   : RSA/SHA256, Wed 02 Oct 2019 05:29:23 PM PDT, Key ID 21ea45ab2f86d6a1
Source RPM  : jemalloc-5.2.1-2.el8.src.rpm
Build Date  : Wed 02 Oct 2019 05:41:14 AM PDT
Build Host  : buildhw-08.phx2.fedoraproject.org
Relocations : (not relocatable)
Packager    : Fedora Project
Vendor      : Fedora Project
URL         : http://www.canonware.com/jemalloc/
Bug URL     : https://bugz.fedoraproject.org/jemalloc
Summary     : General-purpose scalable concurrent malloc implementation
Description :
General-purpose scalable concurrent malloc(3) implementation.
This distribution is the stand-alone "portable" implementation of jemalloc.

In the interests of finally being done with this (without having to wait for some obscure jemalloc issue to be hunted down at those developers' leisure), is there some world shattering problem that will occur when using pufferfish/salmon without jemalloc? I know these programs do not build that way by default, but it is easy enough to do so by hand, and doing that has the distinct advantage that the programs will actually run on this system!

Also, I just double checked, and the salmon binary your lab distributes is not actually linked against jemalloc, neither are any of the libraries which ship with it.

rob-p commented

@mathog,

If built with just the system malloc, the programs will run fine (i.e. they will produce the correct output). The main reason we link against jemalloc is that both pufferfish and salmon have embarrassingly parallel phases, and so they benefit greatly from both being optimized for multithreaded execution. In that context, scalable memory allocation becomes a bottleneck. So, I would expect the performance profile to change when leaving out jemalloc. We've not tested in that configuration extensively, so I can't give you a good estimate on how much. Alternatively, there are other "drop-in" replacements for malloc and free that you could substitute, like tcmalloc or mimalloc (which actually looks to perform better than jemalloc under most metrics). However, these are all runtime differences, which are secondary to having the program actually run!

rob-p commented

@mathog,

P.S. As you can see, I reported this upstream to the jemalloc folks. One suggestion they have is to compile jemalloc with --enable-debug passed to configure. We'll try this on our end, but we're not getting the segfault right now, so that might not be helpful.

The releases for pufferfish and salmon have not been updated since the nature of the jemalloc/libtbb problem was identified. Are there fixed versions in one of the branches on github?

rob-p commented

Yes, the fix is on develop on both. The next salmon release (1.3.0) has a few other improvements and fixes and will likely be out latet this week or next week.

Pufferfish's develop develop branch compiled (with warnings) and the resulting binary worked. No segfault even though it is linked to both lilbjemalloc and libtbbmalloc_proxy.
Test was:

pufferfish index -r transcripts.fasta -o /tmp/pufftest

This patch eliminates the compiler warnings outside of htslib:

patch.txt

It is hard to apply though because some of the patched files don't appear until during the build (external file is downloaded then). So this is how I had to build it:

  package=pufferfish #from current_version.txt in download
  pversion=1.0.0d
  TOPDIR=/usr/common/modules/el8/x86_64/software/${package}/${pversion}-CentOS-vanilla
  THELUA=/usr/common/modules/el8/x86_64/modules/all/${package}/${pversion}-CentOS-vanilla.lua
  cd /usr/common/src
  git clone -b develop https://github.com/COMBINE-lab/pufferfish.git
  mv ${package} ${package}-${pversion}
  cd ${package}-${pversion}
  cp /tmp/patch.txt . #patch removes some compiler warnings
  mkdir build
  cd build
  # module load htslib #ignored, it builds its own anyway
  cmake \
    -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON \
    -DCMAKE_INSTALL_PREFIX=$TOPDIR \
    .. | tee cmake_2020_06_23.log
  make 2>&1 | tee build_2020_06_23.log
  cd ..
  patch -p0 <patch.txt
  cd build
  make 2>&1 | tee build_2020_06_23B.log
  make install 2>&1 | tee install_2020_06_23.log
  #some parts not installed into target
  mkdir -p $TOPDIR/bin
  cd src
  mv bcalm_pufferize cedar edgedensity edgedensity2 \
    filtersam krakmap pufferfish  $TOPDIR/bin
  mv lib*.a $TOPDIR/lib64
  chmod 644 $TOPDIR/lib64/*
  cd ../..
  chmod 755 bin/fixFasta
  mv bin/fixFasta   $TOPDIR/bin
  module_generate_from_directory.sh \
    $package \
    $pversion \
    CentOS/vanilla \
    $TOPDIR \
    "Mime and memory-efficient data structure for indexing a compacted, colored de Bruijn graph (ccdBG)." \
    "https://github.com/COMBINE-lab/pufferfish"
  module load pufferfish
  cd /tmp
  mkdir pufftest
  pufferfish index -r transcripts.fasta -o /tmp/pufftest
  #worked!

Note that "make install" did not actually install the binaries into the target, that had to be done later.

There were many compiler warnings in htslib (I fixed one then decided to stop). When htslib 1.10 is installed on this system (for samtools) it builds cleanly. The htslib in pufferfish seems to be older. Perhaps at some point it can be updated? That will remove the remaining compiler warnings.

Here are the log files from the build;
cmake_2020_06_23.log
build_2020_06_23.log
build_2020_06_23B.log

rob-p commented

Thanks for the patch and the detailed logs. The version of HTSlib is due to the recursive-dependency in seqlib. The all target actually builds a lot more than just pufferfish. If you issue a
make pufferfish, it will just build the pufferfish executable, which can do the indexing and mapping. The other
executables consist of some auxiliary processing utilities and code for doing some taxonomic abundance estimation
(this is where the libSetCover code comes in).

Regarding applying the patch, there is an option to perform a "patch" step to a downloaded and decompressed
file in CMake. Perhaps it would fit in there.