lancet missed a large region with many variants
gilhornung opened this issue · 2 comments
Hi Giuseppe,
I used lancet to perform variant calling on a bacterial sample (evolved strain vs. ancestor). It missed a very large region, where there are many obvious variants. You can see an example in the IGV image below. The MD
tags exist, and there are no XZ
tags.
Here are the commands I used:
lancet -t A25.3813724-3814994.bam -n A1.3813724-3814994.bam -r Bacillus_subtilis_subsp_subtilis_str_168.GCA_000009045.1.30.dna.toplevel.fa -p Chromosome -X 4 > A25.vcf
grep 3815045 A25.vcf
grep 3814359 A25.vcf
You can find all the relevant files in https://owncloud.incpm.weizmann.ac.il/index.php/s/VqXVDfeahNT4hBr
And this is an IGV image of the variant at 3815045 (the bottom is the "tumor" and the top is the "normal")
Hi Gil,
thank you again for reporting the issue and for providing all the necessary files to reproduce your problem. Much appreciated!
There are a couple of reasons why Lancet is having difficulties with this data:
-
the main problem is that the evolved strain has a very high rate of polymorphism compared to the ancestor (see below the IGV screenshot for the region that you shared). Lancet was not designed to work with such a high polymorphic rate. Due to the multiple SNPs in close proximity to each other, Lancet may face problems anchoring the assemblies back to the reference because it cannot find a kmer perfectly matching the reference in the flanking region of the mutation.
-
second, your reads have higher sequencing error rate compared to what I usually work with and, combined with the high coverage in your data (>500x), it produces too many candidate events (bubbles) in the graph, which exponentially increase the algorithmic complexity to analyze the region (you may have seen also higher memory requirements due to the same issue).
In short, the default parameters of Lancet are too sensitive for your data. However, by simply increasing the minimum coverage used to remove low-coverage nodes in the graph (--cov-ratio) from 0.01 to 0.05, I was able to assemble the region and detect the somatic SNV located at 3814359:
$lancet -v -t A25.3813724-3814994.bam \
-n A1.3813724-3814994.bam \
-r Bacillus_subtilis_subsp_subtilis_str_168.GCA_000009045.1.30.dna.toplevel.fa \
-p Chromosome:3814359-3814359 \
--cov-ratio 0.05 \
> test.vcf
[gnarzisi@pennstation02 test]$ grep 3814359 test.vcf
Chromosome 3814359 . A G 1334.28 PASS SOMATIC;FETS=1334.28;TYPE=snv;LEN=1;KMERSIZE=13;SB=11.5056 GT:AD:SR:SA:DP 0/0:561,0:269,292:0,0:561 0/1:112,242:55,57:109,133:354
Not sure this will solve all you problems though. Most likely you will have to further tune other parameters to achieve better performance on your data.
Sure thing. I find that lancet provides much better results then previous aligners I've used (VarScan2, Mutect and Mutect2), at least when I look at the IGV images. So I also have an interest in making it work better.
I think that for this specific project I will go back to VarScan2. I expect that due to the high depth then alignment artefacts will be less critical for VarScan2, and it will be easier than trying to calibrate lancet parameters.
Maybe you can throw a warning when there are too many candidate events, and suggest to increase the --cov-ratio? This way the user may at lease be aware that there may be something wrong.