odelaneau/shapeit4

shapeit fails with error message - Segmentation fault (core dumped)

Opened this issue · 26 comments

I'm trying to phase a vcf with 983 exome samples.

The command I ran -

/mnt/exome/Softwares/shapeit4/bin/shapeit4.2 --input MOD_hg19.vcf.gz --map /mnt/exome/Softwares/shapeit4/maps/chr"$chr".b37.gmap.gz --region "$chr" --output chr"$chr".phased.vcf.gz

I am looping over all chromosomes, it seems like the error occurs for all chromosomes.

The error -

SHAPEIT
  * Author        : Olivier DELANEAU, University of Lausanne
  * Contact       : olivier.delaneau@gmail.com
  * Version       : 4.2.0
  * Run date      : 19/01/2021 - 17:25:51

Files:
  * Input VCF     : [MOD_hg19.vcf.gz]
  * Genetic Map   : [/mnt/exome/Softwares/shapeit4/maps/chr1.b37.gmap.gz]
  * Output VCF    : [chr1.phased.vcf.gz]

Parameters:
  * Seed    : 15052011
  * Threads : 10 threads
  * MCMC    : 15 iterations [5b + 1p + 1b + 1p + 1b + 1p + 5m]
  * PBWT    : Depth of PBWT neighbours to condition on: 4
  * PBWT    : Store indexes at variants [MAC>=2 / MDR<=0.5 / Dist=0.02 cM]
  * HMM     : K is variable / min W is 2.50cM / Ne is 15000
  * HMM     : Recombination rates given by genetic map
  * HMM     : AVX2 optimization active

Initialization:
  * VCF/BCF scanning [N=983 / L=175794 / Reg=1] (44.73s)
  * VCF/BCF parsing [Hom=53.5% / Het=3.2% / Mis=43.3%] (48.68s)
  * GMAP parsing [n=256895] (0.26s)
  * cM interpolation [s=16794 / i=159000] (0.03s)
  * Region length [249218770 bp / 286.3 cM]
  * HMM parameters [Ne=15000 / Error=0.0001 / #rare=45428]
  * PBWT indexing [l=3228] (0.01s)
  * HAP update (1.17s)
  * H2V transpose (0.23s)
  * PBWT phase sweep (12.61s)
  * Build genotype graphs [seg=1869070] (0.55s)

Burn-in iteration [1/5]
  * V2H transpose (0.36s)
  * PBWT selection (1.35s)
  * C2H transpose (0.14s)
  * HMM computations [K=185.689+/-115.639 / W=5.16Mb] (225.76s)
Segmentation fault (core dumped)

What could be the issue?

Hi,

My bad. I forgot to disable a mutex when using --thread 1 in the pre-release of v4.2.0. Bug is now fixed.

Alternatively, you can use --thread T with T>1 to make the previous version run.

Best,

Olivier.

Hay,

In the command which I had posted, I had forgot to include that I had indeed used the argument --thread 10, as can be seen from the Threads count in parameters section.

I still get the same error. I get the same error even if I don't specify the threads. I tried different values like 4 and 16 but I still get the same error.

Okay, so the issue is more complex than what I thought.

Can u send me a small dataset reproducing the error on my email?

A particularity I saw in your data is the super high missing data rate ~43.3%. Would be good to know why you've got this.

A particularity I saw in your data is the super high missing data rate ~43.3%. Would be good to know why you've got this.

We have ~983 samples from 5-6 different exome capture kits, ranging from NextEra to Agilent CREv2, so it could explain the high missing data rate.

I will get a subset of the data and try to share it with you.

Hay, I've shared the chr 1 file with you, do let me know if anything else is needed.

I have the same issue and I know at least another researcher experiencing the same segfault. I would be happy to test the code with my pipeline if given access to a development version of the code.

Hi,

Thanks for the data set you sent.

I tried multiple times with the last version v4.2, changing some parameters, and I cannot reproduce your segfault. The job completes each time I run it. I really do not understand what is going on ! What gcc version did you use to compile? I'll try with it. In my case, I used v7.5.0

On another note, I looked at your data and hereafter some comments I wanted to make:

  • Huge amount of missing data (>40%) concentrated only some variants with >90% missing. I strongly recommend to use a reference panel here, might greatly help.
  • Many genotypes are directly called from low depth (coverage < 4x). You should expect high error rates here.
  • At some variants, it seems to me that only het or hom_alt genotypes are called and not hom_ref. You should double check those.

Best,

Olivier.

Edit: I'll add static binaries available very soon on github for testing purposes.

Thanks for the reply, I actually installed shapeit4 from Conda. I think I might try to compile it myself and check it once.

As for the comments on the data,

  • Huge amount of missing data (>40%) concentrated only some variants with >90% missing. I strongly recommend to use a reference panel here, might greatly help. NOTE: There are samples here from multiple capture kits, ranging from smaller NextEra to more comprehensive CREV2. This might explain the imbalance in data. Classifying and creating dataset based on capture kits might be an option but with the large cohort in hand that might increase load during analysis. Might also present problems when comparing variants between samples sequenced with different capture kits.
  • Many genotypes are directly called from low depth (coverage < 4x). You should expect high error rates here. NOTE: We are using padded bed files to ensure no variants are missed due to breadth of coverage, this could be a side-effect from that. I might apply a cut-off of 10x or 15x, let's see
  • At some variants, it seems to me that only het or hom_alt genotypes are called and not hom_ref. You should double check those. NOTE: They could be population specific variants.

I have the same problem with shapeit4.2 compiled with gcc version 10.2 using Ubuntu 20.10 and Ubuntu 21.04

You can reproduce the segfault with the HapMap data at this temporary URL:

https://www.dropbox.com/sh/35jja570f1n3bui/AACwSJGeZg-DDxvgFQdbGr9Ua?dl=0

The link contains also a dockerfile to recreate an environment that reproduces the error

$ shapeit4.2 --input input.bcf --reference ref.bcf --map genetic_map.txt --region chr6 --output output.bcf

SHAPEIT
  * Author        : Olivier DELANEAU, University of Lausanne
  * Contact       : olivier.delaneau@gmail.com
  * Version       : 4.2.0
  * Run date      : 02/03/2021 - 10:47:30

Files:
  * Input VCF     : [input.bcf]
  * Reference VCF : [ref.bcf]
  * Genetic Map   : [genetic_map.txt]
  * Output VCF    : [output.bcf]

Parameters:
  * Seed    : 15052011
  * Threads : 1 threads
  * MCMC    : 15 iterations [5b + 1p + 1b + 1p + 1b + 1p + 5m]
  * PBWT    : Depth of PBWT neighbours to condition on: 4
  * PBWT    : Store indexes at variants [MAC>=2 / MDR<=0.5 / Dist=0.02 cM]
  * HMM     : K is variable / min W is 2.50cM / Ne is 15000
  * HMM     : Recombination rates given by genetic map
  * HMM     : AVX2 optimization active

Initialization:
  * VCF/BCF scanning [Nm=40 / Nr=3202 / L=8806 / Reg=chr6] (28.80s)
  * VCF/BCF parsing [Hom=69.4% / Het=30.5% / Mis=0.1%] (30.46s)
  * GMAP parsing [n=224546] (0.18s)
  * cM interpolation [s=7575 / i=1231] (0.01s)
  * Region length [44334760 bp / 70.2 cM]
  * HMM parameters [Ne=15000 / Error=0.0001 / #rare=45]
  * PBWT indexing [l=1952] (0.00s)
  * HAP update (0.00s)
  * H2V transpose (0.04s)
  * PBWT phase sweep (0.60s)
  * Build genotype graphs [seg=35791] (0.01s)

Burn-in iteration [1/5]
  * V2H transpose (0.00s)
  * PBWT selection (0.26s)
  * C2H transpose (0.00s)
  * HMM computations [K=112.776+/-97.535 / W=3.34Mb] (1.00s)
munmap_chunk(): invalid pointer
Aborted (core dumped)

Running with valgrind shows the following:

$ valgrind shapeit4.2 --thread 2 --input input.bcf --reference ref.bcf --map genetic_map.txt --region chr6 --output output.bcf

==1639510== Memcheck, a memory error detector
==1639510== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==1639510== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright info
==1639510== Command: shapeit4.2 --thread 2 --input input.bcf --reference ref.bcf --map genetic_map.txt --region chr6 --output output.bcf
==1639510== 

SHAPEIT
  * Author        : Olivier DELANEAU, University of Lausanne
  * Contact       : olivier.delaneau@gmail.com
  * Version       : 4.2.0
  * Run date      : 02/03/2021 - 10:43:49

Files:
  * Input VCF     : [input.bcf]
  * Reference VCF : [ref.bcf]
  * Genetic Map   : [genetic_map.txt]
  * Output VCF    : [output.bcf]

Parameters:
  * Seed    : 15052011
  * Threads : 2 threads
  * MCMC    : 15 iterations [5b + 1p + 1b + 1p + 1b + 1p + 5m]
  * PBWT    : Depth of PBWT neighbours to condition on: 4
  * PBWT    : Store indexes at variants [MAC>=2 / MDR<=0.5 / Dist=0.02 cM]
  * HMM     : K is variable / min W is 2.50cM / Ne is 15000
  * HMM     : Recombination rates given by genetic map
  * HMM     : AVX2 optimization active

Initialization:
  * VCF/BCF scanning [Nm=40 / Nr=3202 / L=8806 / Reg=chr6] (705.77s)
  * VCF/BCF parsing [Hom=69.4% / Het=30.5% / Mis=0.1%] (875.84s)
  * GMAP parsing [n=224546] (5.77s)
  * cM interpolation [s=7575 / i=1231] (0.16s)
  * Region length [44334760 bp / 70.2 cM]
  * HMM parameters [Ne=15000 / Error=0.0001 / #rare=45]
  * PBWT indexing [l=1952] (0.02s)
  * HAP update (0.04s)
  * H2V transpose (0.57s)
  * PBWT phase sweep (19.03s)
  * Build genotype graphs [seg=35791] (0.25s)

Burn-in iteration [1/5]
  * V2H transpose (0.01s)
  * PBWT selection (11.50s)
  * C2H transpose (0.02s)
==1639510== Thread 3:[0%]
==1639510== Conditional jump or move depends on uninitialised value(s)
==1639510==    at 0x127A79: haplotype_segment_single::forward() (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x1868E2: phaser::phaseWindow(int, int) (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x18730A: phaseWindow_callback(void*) (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x49D858F: start_thread (pthread_create.c:463)
==1639510==    by 0x4E57222: clone (clone.S:95)
==1639510== 
==1639510== Conditional jump or move depends on uninitialised value(s)
==1639510==    at 0x128CF3: haplotype_segment_single::forward() (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x1868E2: phaser::phaseWindow(int, int) (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x18730A: phaseWindow_callback(void*) (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x49D858F: start_thread (pthread_create.c:463)
==1639510==    by 0x4E57222: clone (clone.S:95)
==1639510== 
==1639510== Use of uninitialised value of size 8
==1639510==    at 0x127DC0: haplotype_segment_single::forward() (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x1868E2: phaser::phaseWindow(int, int) (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x18730A: phaseWindow_callback(void*) (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x49D858F: start_thread (pthread_create.c:463)
==1639510==    by 0x4E57222: clone (clone.S:95)
==1639510== 
==1639510== Use of uninitialised value of size 8
==1639510==    at 0x128027: haplotype_segment_single::forward() (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x1868E2: phaser::phaseWindow(int, int) (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x18730A: phaseWindow_callback(void*) (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x49D858F: start_thread (pthread_create.c:463)
==1639510==    by 0x4E57222: clone (clone.S:95)
==1639510== 
==1639510== Conditional jump or move depends on uninitialised value(s)
==1639510==    at 0x12AA53: haplotype_segment_single::backward(std::vector<double, std::allocator<double> >&, std::vector<float, std::allocator<float> >&) (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x186900: phaser::phaseWindow(int, int) (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x18730A: phaseWindow_callback(void*) (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x49D858F: start_thread (pthread_create.c:463)
==1639510==    by 0x4E57222: clone (clone.S:95)
==1639510== 
==1639510== Use of uninitialised value of size 8
==1639510==    at 0x129EC0: haplotype_segment_single::backward(std::vector<double, std::allocator<double> >&, std::vector<float, std::allocator<float> >&) (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x186900: phaser::phaseWindow(int, int) (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x18730A: phaseWindow_callback(void*) (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x49D858F: start_thread (pthread_create.c:463)
==1639510==    by 0x4E57222: clone (clone.S:95)
==1639510== 
==1639510== Conditional jump or move depends on uninitialised value(s)
==1639510==    at 0x12AE0E: haplotype_segment_single::backward(std::vector<double, std::allocator<double> >&, std::vector<float, std::allocator<float> >&) (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x186900: phaser::phaseWindow(int, int) (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x18730A: phaseWindow_callback(void*) (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x49D858F: start_thread (pthread_create.c:463)
==1639510==    by 0x4E57222: clone (clone.S:95)
==1639510== 
==1639510== Conditional jump or move depends on uninitialised value(s)
==1639510==    at 0x12AFEB: haplotype_segment_single::backward(std::vector<double, std::allocator<double> >&, std::vector<float, std::allocator<float> >&) (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x186900: phaser::phaseWindow(int, int) (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x18730A: phaseWindow_callback(void*) (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x49D858F: start_thread (pthread_create.c:463)
==1639510==    by 0x4E57222: clone (clone.S:95)
==1639510== 
==1639510== Use of uninitialised value of size 8
==1639510==    at 0x12A197: haplotype_segment_single::backward(std::vector<double, std::allocator<double> >&, std::vector<float, std::allocator<float> >&) (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x186900: phaser::phaseWindow(int, int) (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x18730A: phaseWindow_callback(void*) (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x49D858F: start_thread (pthread_create.c:463)
==1639510==    by 0x4E57222: clone (clone.S:95)
==1639510== 
==1639510== Conditional jump or move depends on uninitialised value(s)
==1639510==    at 0x129B2F: haplotype_segment_single::backward(std::vector<double, std::allocator<double> >&, std::vector<float, std::allocator<float> >&) (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x186900: phaser::phaseWindow(int, int) (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x18730A: phaseWindow_callback(void*) (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x49D858F: start_thread (pthread_create.c:463)
==1639510==    by 0x4E57222: clone (clone.S:95)
==1639510== 
==1639510== Use of uninitialised value of size 8
==1639510==    at 0x12A273: haplotype_segment_single::backward(std::vector<double, std::allocator<double> >&, std::vector<float, std::allocator<float> >&) (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x186900: phaser::phaseWindow(int, int) (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x18730A: phaseWindow_callback(void*) (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x49D858F: start_thread (pthread_create.c:463)
==1639510==    by 0x4E57222: clone (clone.S:95)
==1639510== 
==1639510== Thread 2:[2%]
==1639510== Conditional jump or move depends on uninitialised value(s)
==1639510==    at 0x12882C: haplotype_segment_single::forward() (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x1868E2: phaser::phaseWindow(int, int) (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x18730A: phaseWindow_callback(void*) (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x49D858F: start_thread (pthread_create.c:463)
==1639510==    by 0x4E57222: clone (clone.S:95)
==1639510== 
  * HMM computations [K=110.431+/-102.296 / W=3.26Mb] (157.69s)
==1639510== Thread 1:
==1639510== Use of uninitialised value of size 8
==1639510==    at 0x126FBF: std::__cxx11::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >::~basic_stringbuf() (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x11BA15: haplotype_set::mergeIBD2constraints() [clone .cold] (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x18A530: phaser::phase() (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x174E18: phaser::phase(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&) (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x123964: main (in /home/genovese/bin/shapeit4.2)
==1639510== 
==1639510== Conditional jump or move depends on uninitialised value(s)
==1639510==    at 0x126FCD: std::__cxx11::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >::~basic_stringbuf() (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x11BA15: haplotype_set::mergeIBD2constraints() [clone .cold] (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x18A530: phaser::phase() (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x174E18: phaser::phase(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&) (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x123964: main (in /home/genovese/bin/shapeit4.2)
==1639510== 
==1639510== Conditional jump or move depends on uninitialised value(s)
==1639510==    at 0x483DF75: operator delete(void*) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==1639510==    by 0x126FD3: std::__cxx11::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >::~basic_stringbuf() (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x11BA15: haplotype_set::mergeIBD2constraints() [clone .cold] (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x18A530: phaser::phase() (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x174E18: phaser::phase(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&) (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x123964: main (in /home/genovese/bin/shapeit4.2)
==1639510== 
==1639510== Invalid free() / delete / delete[] / realloc()
==1639510==    at 0x483DFBF: operator delete(void*) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==1639510==    by 0x126FD3: std::__cxx11::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >::~basic_stringbuf() (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x11BA15: haplotype_set::mergeIBD2constraints() [clone .cold] (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x18A530: phaser::phase() (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x174E18: phaser::phase(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&) (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x123964: main (in /home/genovese/bin/shapeit4.2)
==1639510==  Address 0x138e4a is in the Text segment of /home/genovese/bin/shapeit4.2
==1639510==    at 0x138E4A: haplotype_set::transposePBWTarrays() (in /home/genovese/bin/shapeit4.2)
==1639510== 
==1639510== Use of uninitialised value of size 8
==1639510==    at 0x126FDF: std::__cxx11::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >::~basic_stringbuf() (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x11BA15: haplotype_set::mergeIBD2constraints() [clone .cold] (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x18A530: phaser::phase() (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x174E18: phaser::phase(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&) (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x123964: main (in /home/genovese/bin/shapeit4.2)
==1639510== 
==1639510== Use of uninitialised value of size 8
==1639510==    at 0x4AADB7C: std::locale::~locale() (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28)
==1639510==    by 0x11BA15: haplotype_set::mergeIBD2constraints() [clone .cold] (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x18A530: phaser::phase() (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x174E18: phaser::phase(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&) (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x123964: main (in /home/genovese/bin/shapeit4.2)
==1639510== 
==1639510== Conditional jump or move depends on uninitialised value(s)
==1639510==    at 0x4AADB82: std::locale::~locale() (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28)
==1639510==    by 0x11BA15: haplotype_set::mergeIBD2constraints() [clone .cold] (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x18A530: phaser::phase() (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x174E18: phaser::phase(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&) (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x123964: main (in /home/genovese/bin/shapeit4.2)
==1639510== 
==1639510== Use of uninitialised value of size 8
==1639510==    at 0x4AADBA5: std::locale::~locale() (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28)
==1639510==    by 0x11BA15: haplotype_set::mergeIBD2constraints() [clone .cold] (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x18A530: phaser::phase() (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x174E18: phaser::phase(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&) (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x123964: main (in /home/genovese/bin/shapeit4.2)
==1639510== 
==1639510== Invalid read of size 4
==1639510==    at 0x4AADBA5: std::locale::~locale() (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28)
==1639510==    by 0x11BA15: haplotype_set::mergeIBD2constraints() [clone .cold] (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x18A530: phaser::phase() (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x174E18: phaser::phase(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&) (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x123964: main (in /home/genovese/bin/shapeit4.2)
==1639510==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==1639510== 
==1639510== 
==1639510== Process terminating with default action of signal 11 (SIGSEGV)
==1639510==  Access not within mapped region at address 0x0
==1639510==    at 0x4AADBA5: std::locale::~locale() (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28)
==1639510==    by 0x11BA15: haplotype_set::mergeIBD2constraints() [clone .cold] (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x18A530: phaser::phase() (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x174E18: phaser::phase(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&) (in /home/genovese/bin/shapeit4.2)
==1639510==    by 0x123964: main (in /home/genovese/bin/shapeit4.2)
==1639510==  If you believe this happened as a result of a stack
==1639510==  overflow in your program's main thread (unlikely but
==1639510==  possible), you can try to increase the size of the
==1639510==  main thread stack using the --main-stacksize= flag.
==1639510==  The main thread stack size used in this run was 8388608.
==1639510== 
==1639510== HEAP SUMMARY:
==1639510==     in use at exit: 21,615,416 bytes in 18,154 blocks
==1639510==   total heap usage: 3,074,393 allocs, 3,056,240 frees, 6,894,923,963 bytes allocated
==1639510== 
==1639510== LEAK SUMMARY:
==1639510==    definitely lost: 0 bytes in 0 blocks
==1639510==    indirectly lost: 0 bytes in 0 blocks
==1639510==      possibly lost: 0 bytes in 0 blocks
==1639510==    still reachable: 21,615,416 bytes in 18,154 blocks
==1639510==         suppressed: 0 bytes in 0 blocks
==1639510== Rerun with --leak-check=full to see details of leaked memory
==1639510== 
==1639510== Use --track-origins=yes to see where uninitialised values come from
==1639510== For lists of detected and suppressed errors, rerun with: -s
==1639510== ERROR SUMMARY: 1137193 errors from 21 contexts (suppressed: 0 from 0)
Segmentation fault (core dumped)

@odelaneau sorry for my previous reply, I didn't install it using Conda, I compiled it in RHEL with gcc verison 8.3.1 20190311. So I don't think that's the cause of my error.

Pretty strange that I'm encountering the error for my dataset, I have tried at least 3-4 times but get same error.

The GCC optimizer has become far more aggressive in recent releases (well, maybe roughly at GCC 7 or so), hence my pull request (#48). It can seem like a nuisance that a bool method exits without returning, but the optimizer assumes that this means that the codepath will never be used. I cannot claim of course that this is the only source of errors, but I can tell that I got crashes and memory corruption when the return type was bool with recent GCCs and proper operation with return type void.

@cnettel your fix solved the issue for me. Thank you! The test case I presented now works and I was able to run multiple SHAPEIT4.2.0 tasks again without experiencing any segmentation faults over a large DNA microarray cohort. And the new version did cut phasing costs almost by a half. In case anybody is interested, I have used the following Dockerfile to run SHAPEIT4.2.0:

FROM ubuntu:21.04
ARG DEBIAN_FRONTEND=noninteractive
RUN apt-get -qqy update --fix-missing && \
    apt-get -qqy install --no-install-recommends \
                 wget \
                 g++ \
                 make \
                 libboost-iostreams-dev \
                 libboost-program-options-dev \
                 libhts-dev \
                 libbz2-dev \
                 libboost-iostreams1.74.0 \
                 libboost-program-options1.74.0 \
                 bcftools && \
    wget --no-check-certificate https://github.com/odelaneau/shapeit4/archive/v4.2.0.tar.gz && \
    tar xzf v4.2.0.tar.gz && \
    cd shapeit4-4.2.0 && \
    sed -i 's/^HTSLIB_INC=\$(HOME)\/Tools\/htslib-1.9$/HTSLIB_INC=-Ihtslib/' makefile && \
    sed -i 's/^HTSLIB_LIB=\$(HOME)\/Tools\/htslib-1.9\/libhts.a$/HTSLIB_LIB=-lhts/' makefile && \
    sed -i 's/^BOOST_LIB_IO=\/usr\/lib\/x86_64-linux-gnu\/libboost_iostreams.a$/BOOST_LIB_IO=-lboost_iostreams/' makefile && \
    sed -i 's/^BOOST_LIB_PO=\/usr\/lib\/x86_64-linux-gnu\/libboost_program_options.a$/BOOST_LIB_PO=-lboost_program_options/' makefile && \
    sed -i 's/^\tbool merge (const IBD2track \& rhs) {$/\tvoid merge (const IBD2track \& rhs) {/' src/containers/haplotype_set.h && \
    make && \
    mv bin/shapeit4.2 /usr/bin/ && \
    cd .. && \
    apt-get -qqy purge --auto-remove --option APT::AutoRemove::RecommendsImportant=false \
                 wget \
                 g++ \
                 make \
                 libboost-iostreams-dev \
                 libboost-program-options-dev \
                 libhts-dev \
                 libbz2-dev && \
    apt-get -qqy clean && \
    rm -rf v4.2.0.tar.gz \
           shapeit4-4.2.0 \
           /var/lib/apt/lists/*

Guys, thank you very much for your help, you saved me so much time!

So I compiled using g++9/g++10 and indeed got the segfault on kvn95ss's dataset.

Fixing the code given cnettel's suggestion solves the issue. Thanks!

I made a pre-release v4.2.1, I'd be grateful if some of you guys could double check that this fixes the previous issue.
If yes, I'll push it as new release.

Of note, I also removed the other htslib related warnings.

Best,

Thank you all for the help.

@odelaneau I'll compile the latest version and see if it sorts the issue. Will update in some time!

EDIT - I was able to compile with no issues, but for some reason it says it can't read the index file, even though I created one using GATK.

Initialization:
  * VCF/BCF scanning ...
ERROR: Problem opening index file for [MODGRCh38.vcf]

pbgzipping and indexing the vcf with tabix -p vcf gives me this error

Initialization:
  * VCF/BCF scanning ...
ERROR: No variants to be phased in [MOD_hg19.vcf.gz]

I have tested the v4.2.1 pre-release on a small cohort and it worked fine. It does still say 4.2.0 in src/phaser/phaser_parameters.cpp though.

@freeseek I updated the tag, thanks.

@kvn95ss
Error1: Seems that your file has not been BGZF compressed before indexing.
Error2: Argument given to shapeit4 option --region is probably incorrect. Typical mistake is --region 20 instead of --region chr20.

Hay @odelaneau

Thanks for that, I was able to sort out the errors. Now it runs on my data!

Hi everyone,
I am getting the same error ERROR: No variants to be phased in [input.vcf.gz]. Any thought about what could be causing this other than Error 1 and mentioned by @odelaneau?

Thank you.

@abedkurdi @odelaneau Any solution with this? I had a chr21 panel that was separated into two mutually exclusive panels, where the unphased panel consists of only 5 samples whereas the other set is used as a reference panel, (both panels have been bgzipped and indexed) but I still get a similar error ERROR: No variants to be phased in files. The reference panel is phased and is of file type (.vcf.gz). Although I tried using a (.bcf.gz) file type as well, but I got the same error. Both the reference panel and unphased panel have all the variant sites in common so not sure what I'm missing, any help would be appreciated.

Try to merge using bcftools merge -m none. The underlying procedure is the same than for shapeit. Just to check what is going on.

Hi @odelaneau, thank you for the quick reply. I was hoping to make use of the reference panel to see how the phasing accuracy would differ. I did run the bcftools merge -m none <ref-panel> <unphased-panel> and it worked fine. Did you want me to try and run ShapeIT on the merged panel that would have some unphased individuals and the rest phased individuals?

Hello @odelaneau,
I compiled shapeit4-4.2.2 with g++ (Ubuntu 11.2.0-19ubuntu1) 11.2.0. I want to shape some .bcf files but I also recieve as the thread opener the following output:

SHAPEIT

  • Author : Olivier DELANEAU, University of Lausanne
  • Contact : olivier.delaneau@gmail.com
  • Version : 4.2.2
  • Run date : 10/11/2022 - 14:39:39

Files:

  • Input VCF : [/home/oem/Documents/GitHub/Genomics_prac_guide/trio/name/name_trio_parents_1kgeu_chrom1.bcf]
  • Genetic Map : [/home/oem/Documents/GitHub/Genomics_prac_guide/map/genetic_maps.b38/chr1.b38.gmap.gz]
  • Output VCF : [/home/oem/Documents/GitHub/Genomics_prac_guide/trio/name/name_trio_chr1_phased.bcf]

Parameters:

  • Seed : 15052011
  • Threads : 4 threads
  • MCMC : 15 iterations [5b + 1p + 1b + 1p + 1b + 1p + 5m]
  • PBWT : Depth of PBWT neighbours to condition on: 4
  • PBWT : Store indexes at variants [MAC>=2 / MDR<=0.5 / Dist=0.02 cM]
  • HMM : K is variable / min W is 2.50cM / Ne is 15000
  • HMM : Recombination rates given by genetic map
  • HMM : AVX2 optimization inactive / Activating AVX2 substantially improves performance

Initialization:

  • VCF/BCF scanning ...
    WARNING: Population based phasing for less than 100 individuals is not recommended, use a reference panel to remove this warning!
  • VCF/BCF scanning [N=66 / L=403558 / Reg=1] (0.23s)
  • VCF/BCF parsing [Hom=2.7% / Het=2.1% / Mis=95.3%] (0.90s)
  • GMAP parsing [n=255802] (0.18s)
  • cM interpolation [s=138145 / i=265413] (0.07s)
  • Region length [248884057 bp / 286.3 cM]
  • HMM parameters [Ne=15000 / Error=0.0001 / #rare=69431]
    Segmentation fault (core dumped)

My system info is the following:
Linux version 5.15.0-50-generic (buildd@lcy02-amd64-086) (gcc (Ubuntu 11.2.0-19ubuntu1) 11.2.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #56-Ubuntu SMP Tue Sep 20 13:23:26 UTC 2022

Is this still an open problem of shapeit4 or did I do something wrong?

Hi,

[Hom=2.7% / Het=2.1% / Mis=95.3%]

It crashes because you have 95.3% missing data!

Best,

O

Thanks for your quick reply! I will see what I can do to fix this issue. Just merged some lifted 23andme files with some CEU 1kg files. But I have no idea why so many missings appear.

Thank you all for the help.

@odelaneau I'll compile the latest version and see if it sorts the issue. Will update in some time!

EDIT - I was able to compile with no issues, but for some reason it says it can't read the index file, even though I created one using GATK.

Initialization:
  * VCF/BCF scanning ...
ERROR: Problem opening index file for [MODGRCh38.vcf]

pbgzipping and indexing the vcf with tabix -p vcf gives me this error

Initialization:
  * VCF/BCF scanning ...
ERROR: No variants to be phased in [MOD_hg19.vcf.gz]

Hi everyone, I am also having the same problem in shapit4, I have the ARS-UCD1.2 reference panel, I did manage to convert it to vcf and index it, but I'm still getting the "no variants to be phased" error. any help, please?

Hi @Nozi-A, I have a similar issue as you. I am using 1000G as a reference panel and the unphased.vcf files were called by GATK HaplotypeCaller. I try to merge my unphased.vcf file with 1000G panel by bcftools but still meet the error:

  • VCF/BCF scanning done (0.04s)
    • Variants [#sites=0 / region=22]

ERROR: No variants to be phased!

Hi @odelaneau, could you offer some suggestions to solve this problem?