mklarqvist/tomahawk

tomahawk import does not generate a twk file

biojiangke opened this issue · 3 comments

I've been trying tomahawk with diffferent VCF data sets. For one set, it worked very well and the results looked great. When I tried another VCF, for the same region, but with more samples, everything ran fine at the "tomahawk" import step but it did not generate a "twk" file. It did not throw an error message either.

The command run was (also tried with default options):

tomahawk import -I locus.bcf -o locus_snp -m 0.2 -h 0.001

The output for the good set has these messages:

Program: tomahawk beta-0.6.1
Contact: Marcus D. R. Klarqvist <mk819@cam.ac.uk>
Documentation: https://github.com/mklarqvist/tomahawk
License: MIT
----------
[2018-11-08 10:31:01,989][LOG] Calling import...
[2018-11-08 10:31:01,997][LOG][VCF] Constructing lookup table for 603 contigs...
[2018-11-08 10:31:01,998][LOG][RLE] Samples: 86 > 15... Skip
[2018-11-08 10:31:01,998][LOG][RLE] Samples: 86 < 4095...
[2018-11-08 10:31:01,998][LOG][RLE] Using 16-bit width...
[2018-11-08 10:31:01,998][LOG][WRITER] Opening: locus_snp.twk...
[2018-11-08 10:31:02,129][LOG][WRITER] Wrote: 2,576 variants to 6 blocks...

The messages for the bigger data set are:

Program: tomahawk beta-0.6.1
Contact: Marcus D. R. Klarqvist <mk819@cam.ac.uk>
Documentation: https://github.com/mklarqvist/tomahawk
License: MIT
----------
[2018-11-08 10:32:18,162][LOG] Calling import...
[2018-11-08 10:32:18,164][LOG][VCF] Constructing lookup table for 603 contigs...

Apparently, tomahawk didn't import any variants.

The VCF files came from GATK pipeline. Both VCFs are sliced at the same locus of a 2Mbp region, the working one with 86 samples, and not working one with 353 samples.

Could you take a look at this?

Hello @biojiangke,
Can you check and make sure that the variants are diploid, biallelic, and SNVs. These are the only variants kept by tomahawk. Sites with mixed ploidy are also filtered out. Lastly, make sure they are not filtered out because of how you parameterized the missingness threshold.
If you should still have some variants could you send me your VCF please?

Hi,

Thanks for the quick response. I tried the following:

Filter out non-biallelic variants:

vcftools --gzvcf locus.vcf.gz --min-alleles 2 --max-alleles 2 --recode --stdout | bgzip -c > locus.biallelic.vcf.gz

Filter out indels:

vcftools --gzvcf locus.biallelic.vcf.gz --remove-indels --recode --recode-INFO-all --out SNPs_only --stdout | bgzip -c > locus.biallelic.snp.vcf.gz

I can confirm the filters as I counted the number of variants before and after filtering and spot checked a few regions. The resulted VCF still does't import for tomahawk, even trying missingness filter from 0.1 to 0.9. Plus, the working VCF have non-biallelic and indels, and tomahawk ignored them and imported the "good" variants. So I don't think those are causing the problem.

I'm bond by regulations and would not be able to share the VCF. But I'll do more tests and report later if I find anything.

This should no longer be happening as of release 0.7.0 as we now rely on htslib to do the parsing.