lh3/fermikit

Sorting; low sensitivity

KrzysztofMKozak opened this issue · 4 comments

Dear Heng,
Looks like a great tool and I tried to use it for a subset of the Anopheles 1000 Genomes data (https://www.malariagen.net/apps/ag1000g/phase1-AR3/index.html).
The first problem I encountered is at the sorting stage: once the first sam was made, fermi proceeded to sort it and generated ~150k tiny sam files - I had to kill it eventually.
The second issue: I ran clean up manually and genotyped with GATK, as GATK had been originally used to genotype after BWA-mapping the same reads to reference. When we compare the BWA and Fermikit genotypes, the Fermi pipeline results in ~18x fewer variants. What would be the key parameters that could help with this, please?
Best,
Chris

lh3 commented

Please use the command line in README for variant calling:

fermi.kit/run-calling -t16 bwa-indexed-ref.fa prefix.mag.gz | sh

GATK won't work well.

lh3 commented

Also to generate multi-sample vcf, first mapping and sort with fermi.kit/bwa mem -x intractg ref.fa unitigs.mag.gz && fermi.kit/htsbox samsort and then do calling with:

fermi.kit/htsbox pileup -cuf ref.fa pre1.srt.bam pre2.srt.bam > out.raw.vcf
fermi.kit/k8 fermi.kit/hapdip.js vcfsum -f out.raw.vcf > out.flt.vcf

See also README.

Thank you! Just to clarify, GATK was used in order to compare the results of mapping raw reads and mapping unitigs, without having the extra confound of using different variant callers.

lh3 commented

Unitigs have very different characteristics from short reads. Most existing tools won't play well with unitigs.