nygenome/lancet

lancet missed a large region with many variants

gilhornung opened this issue · 2 comments

Hi Giuseppe,

I used lancet to perform variant calling on a bacterial sample (evolved strain vs. ancestor). It missed a very large region, where there are many obvious variants. You can see an example in the IGV image below. The MD tags exist, and there are no XZ tags.

Here are the commands I used:

lancet -t A25.3813724-3814994.bam -n A1.3813724-3814994.bam -r Bacillus_subtilis_subsp_subtilis_str_168.GCA_000009045.1.30.dna.toplevel.fa -p Chromosome -X 4 > A25.vcf
grep 3815045 A25.vcf
grep 3814359 A25.vcf

You can find all the relevant files in https://owncloud.incpm.weizmann.ac.il/index.php/s/VqXVDfeahNT4hBr

And this is an IGV image of the variant at 3815045 (the bottom is the "tumor" and the top is the "normal")
3815045

Hi Gil,

thank you again for reporting the issue and for providing all the necessary files to reproduce your problem. Much appreciated!

There are a couple of reasons why Lancet is having difficulties with this data:

  1. the main problem is that the evolved strain has a very high rate of polymorphism compared to the ancestor (see below the IGV screenshot for the region that you shared). Lancet was not designed to work with such a high polymorphic rate. Due to the multiple SNPs in close proximity to each other, Lancet may face problems anchoring the assemblies back to the reference because it cannot find a kmer perfectly matching the reference in the flanking region of the mutation.

  2. second, your reads have higher sequencing error rate compared to what I usually work with and, combined with the high coverage in your data (>500x), it produces too many candidate events (bubbles) in the graph, which exponentially increase the algorithmic complexity to analyze the region (you may have seen also higher memory requirements due to the same issue).

In short, the default parameters of Lancet are too sensitive for your data. However, by simply increasing the minimum coverage used to remove low-coverage nodes in the graph (--cov-ratio) from 0.01 to 0.05, I was able to assemble the region and detect the somatic SNV located at 3814359:

$lancet -v -t A25.3813724-3814994.bam \
        -n A1.3813724-3814994.bam \
        -r Bacillus_subtilis_subsp_subtilis_str_168.GCA_000009045.1.30.dna.toplevel.fa \
        -p Chromosome:3814359-3814359 \
        --cov-ratio 0.05 \
        > test.vcf

[gnarzisi@pennstation02 test]$ grep 3814359 test.vcf
Chromosome	3814359	.	A	G	1334.28	PASS	SOMATIC;FETS=1334.28;TYPE=snv;LEN=1;KMERSIZE=13;SB=11.5056	GT:AD:SR:SA:DP	0/0:561,0:269,292:0,0:561	0/1:112,242:55,57:109,133:354

Not sure this will solve all you problems though. Most likely you will have to further tune other parameters to achieve better performance on your data.

bacillus_subtilis

Sure thing. I find that lancet provides much better results then previous aligners I've used (VarScan2, Mutect and Mutect2), at least when I look at the IGV images. So I also have an interest in making it work better.

I think that for this specific project I will go back to VarScan2. I expect that due to the high depth then alignment artefacts will be less critical for VarScan2, and it will be easier than trying to calibrate lancet parameters.

Maybe you can throw a warning when there are too many candidate events, and suggest to increase the --cov-ratio? This way the user may at lease be aware that there may be something wrong.