core dump error on import
peterdfields opened this issue · 20 comments
Hi,
I'm trying to import a bcf file that was generated by first converting a GATK vcf to bcf with bcftools. I'm getting the following error:
Program: tomahawk-beta-0.7.1 (Tools for computing, querying and storing LD data)
Libraries: tomahawk-0.7.0; ZSTD-1.4.0; htslib 1.9
Contact: Marcus D. R. Klarqvist <mk819@cam.ac.uk>
Documentation: https://github.com/mklarqvist/tomahawk
License: MIT
----------
[2019-05-20 16:53:15,426][LOG] Calling import...
[2019-05-20 16:53:15,426][LOG][READER] Opening snp.bcf...
[2019-05-20 16:53:15,433][LOG][VCF] Constructing lookup table for 608 contigs...
[2019-05-20 16:53:15,434][LOG][VCF] Samples: 56...
[2019-05-20 16:53:15,434][LOG][WRITER] Opening snp.twk...
00000000
00001010
tomahawk: lib/core.cpp:117: void tomahawk::twk1_t::calculateHardyWeinberg(): Assertion `ref == 0 || ref == 1 || ref == 4 || ref == 5' failed.
Aborted (core dumped)
The SNPs seem to meet the expectations of the program. I'm not entirely sure what's going wrong here. Please let me know if additional info would be useful.
Thanks for reporting this @peterdfields . The problem appears to an assertion I've place in the computation of Hardy-Weinberg equilibrium. For some reason the offending line has an allele that is not biallelic (encodings 0, 1, 4, or 5 in my internal format). This should not happen as non-biallelic, non-diploid sites should be filtered out. Must be some edge case I have not covered.
Could you find the offending line and report it to me? By email if it is not public.
Hi @mklarqvist. Given that an allele remains non-biallelic I have to assume it has somehow made it past gatk selectvariants and vcftools filtering for biallelic snps. Is there a way to force tomahawk to output the line that has the error? I tried with version of the program built with make DEBUG=true
but that doesn't change the import stdout info.
@peterdfields I'll update the error message to reflect the offending variant line number and offending allele encoding. This is something I should've done in the first place.
@mklarqvist would there be an alternative method to localize the problem line?
@peterdfields A crude way would be like a manual binary search:
- Input first half of file and check (easiest way is to pipe the data in from bcftools | head -n | tomahawk import)
- Input second half and check
- Keep splitting the half that fails. You should be able to deduce pretty quickly what the offending line is
I'm digging through the code to find the problem.
@mklarqvist Okay, I followed your advice about doing the manual binary search. The line from the vcf that is causing the error is as follows:
000011F|quiver 1151 . A T 1216.54 . . GT:AD:DP:GQ:PL 1/1:0,1:1:3:40,3,0 1/1:0,3:3:9:109,9,0 1/1:0,3:3:9:118,9,0 1/1:0,5:5:15:155,15,0 1/1:0,2:2:6:68,6,0 1/1:0,2:2:6:69,6,0 1/1:0,4:4:12:158,12,0 1/1:0,2:2:6:86,6,0 1/1:0,3:3:9:124,9,0 1/1:0,3:3:9:116,9,0 1/1:0,1:1:3:43,3,0 0/0:4,0:4:9:0,9,135 ./.:0,0:0:.:0,0,0 1/1:0,3:3:9:125,9,0 1/1:0,5:5:15:202,15,0
@peterdfields Thanks for helping me getting to the bottom of this. Very helpful! I am investigating this.
@mklarqvist no worries! I'm looking forward to exploring tomahawk.
Hi @mklarqvist. Any news about this issue? Thank you again for your help.
Hello @peterdfields . Sorry for the delay in resolving this. I returned today from a trip abroad. Will pick up were I left of. Thanks for your patience!
Hi @mklarqvist. Okay, great. Thank you again for your assistance!
Hi @mklarqvist. Any luck on tracking down this issue?
Hey @mklarqvist,
I got the same issue, I think the problem is related to missing data, or at least with './.' in the GT field. Replacing missing data with random genotypes or removing loci with any missing data solves the problem with import in my case.
I have the same problem too. Yes, removing sites with ANY missing data will resolve the situation, but this is not really a practical approach for my dataset.
Thanks
Same problem here... Is there a different way we can encode missing data so that it can be captured?
Same problem here ... with command line:
tomahawk import -i allsamples_All.bcf -o snp m 0.2 h 0.01
Program: tomahawk-beta-0.7.1 (Tools for computing, querying and storing LD data)
Libraries: tomahawk-0.7.0; ZSTD-1.4.4; htslib 1.9
Contact: Marcus D. R. Klarqvist <mk819@cam.ac.uk>
Documentation: https://github.com/mklarqvist/tomahawk
License: MIT
----------
[2019-10-30 10:07:36,137][LOG] Calling import...
[2019-10-30 10:07:36,138][LOG][READER] Opening allsamples_All.bcf...
[2019-10-30 10:07:36,139][LOG][VCF] Constructing lookup table for 43 contigs...
[2019-10-30 10:07:36,139][LOG][VCF] Samples: 573...
[2019-10-30 10:07:36,139][LOG][WRITER] Opening snp.twk...
00000000
00000000
00000000
00000000
00001010
tomahawk: lib/core.cpp:117: void tomahawk::twk1_t::calculateHardyWeinberg(): Assertion `ref == 0 || ref == 1 || ref == 4 || ref == 5' failed.
Aborted (core dumped)
And, besides that,
tomahawk import -i allsamples_All.bcf -o snp -m 0.2 -h 0.01
tomahawk: invalid option -- 'm'
[2019-10-30 10:07:30,955][ERROR] Unrecognized option: ?
And the examples I found are wll with the dash as -m 0.xx -h 0.001 etc.
Isn't the dash needed for the options?
Hi. Is there an update for this problem? I have the same problem as well. Thanks
I am getting the same -m error on Ubuntu 20.
$ tomahawk import -i snp-thin.bcf -o snp -m 0.2 -h 0.001 tomahawk: invalid option -- 'm' [2021-08-26 13:26:31,717][ERROR] Unrecognized option: ?
I think I figured out that for 'import' the -m has changed to -n and -h is now -H. But I am getting the same core dump.
$ tomahawk import -i snp-thin.bcf -o snp -n 0.2 -H 0.001
Program: tomahawk-beta-0.7.1 (Tools for computing, querying and storing LD data)
Libraries: tomahawk-0.7.0; ZSTD-1.4.4; htslib 1.9
Contact: Marcus D. R. Klarqvist <mk819@cam.ac.uk>
Documentation: https://github.com/mklarqvist/tomahawk
License: MIT
----------
[2021-08-26 13:43:19,573][LOG] Calling import...
[2021-08-26 13:43:19,574][LOG][READER] Opening snp-thin.bcf...
[2021-08-26 13:43:19,598][LOG][VCF] Constructing lookup table for 2,100 contigs...
[2021-08-26 13:43:19,598][LOG][VCF] Samples: 96...
[2021-08-26 13:43:19,598][LOG][WRITER] Opening snp.twk...
00000000
00000000
00000000
00000000
00000000
00000001
00000000
00000001
00001010
tomahawk: lib/core.cpp:117: void tomahawk::twk1_t::calculateHardyWeinberg(): Assertion `ref == 0 || ref == 1 || ref == 4 || ref == 5' failed.
Aborted (core dumped)
I also get this
tomahawk: lib/core.cpp:117: void tomahawk::twk1_t::calculateHardyWeinberg(): Assertion `ref == 0 || ref == 1 || ref == 4 || ref == 5' failed.
Aborted (core dumped)
error. I don't know what is not working.