milkschen/leviosam2

About long read did not run with reconcile at setting

Closed this issue · 4 comments

Hi,

I noticed that the provided workflow for the long read did not have the reconcile module as for the short read (workflow code). I wonder whether there is a reason for not using the reconcile module when running with the long read.

As in the recommended setting for long read (as in ../configs/pacbio_all.yaml), there is a setting likes nm_threshold: 0. I wonder whether this setting means that the tool will directly realign all alignments with an NM>0 while disregarding the lifted information.

Thanks.

It's fine to use the reconcile module for long reads, as what we did for short reads. One difference between the short- and long-read pipelines is that the long read pipeline has a stricter set of deferring rules, resulting in less reads deferred. Thus, we wanted to trust the re-mapped results more and didn't include the reconcile module in the pipeline. It's possible that including reconcile can further improve the results.

Yeah, as you observed, pacbio_all.yaml re-aligns all the lifted alignments. This means that we perform localized dynamic programming to polish the results around the lifted position of the alignment using the target reference. So while the base-level alignments are re-done, in general the read is mapped to somewhere near the lifted position. In this way, the lifted information is not fully discarded (nucleotide alignments are discarded, but the mapping is not).

Meanwhile, re-aligning a read to the target reference (we used the term re-map in our manuscript) would just re-do everything, so the read can go anywhere in the reference that has a higher alignment score. In this way the lift-over info is not considered.

Got it, many thanks for your prompt reply! The workflow is clear to me now.

Currently, I am trying to apply leviosam2 to ONT long-read data. I observed that leviosam2 would generate far less committed alignments compared to deferred alignments (1:20) for ONT data. If I understand correctly, it is better to use the reconcile module to improve the deferred alignments in the case of many deferred alignments generated for ONT data. So I will use the reconcile module in my testing.

As the levioSAM2 system works for "lifting" read mappings, do you ever try to apply the tool direct to other types of annotations, such as variants and genotypes? What do you think when applying "lifting" for variants? I am really interested in how will the performance has compared between (alignment to source -> lift alignment to target -> variant calling) and (alignment to source -> variant calling at source -> lift variants to target).

Thanks for the wonderful tool built. The tool and preprint are quite impressive.

JH

These are great points.

Our long-read commit/defer/suppress decisions are developed using HiFi data, so it's reasonable that they don't work well for ONT. I think it's a great idea to include the reconcile module for ONT. I'll do more tests to evaluate if we'd like to include reconcile to all long reads in a future release.

We have a beta version for lifting over BED files (leviosam2 bed). It's faster than UCSC liftOver, but we're still checking some edge cases (sometimes the results are off by 1 bp).

We haven't implemented VCF lift-over, where reference allele differences need to be considered. CrossMap and Picard LiftoverVcf can lift over VCF files. We tested CrossMap and it had a 1-2% accuracy drop when lifting over between hg19<->hg38; its performance for chm13->hg38 was much worse, probably because of more differed REF alleles. In general, our evaluation shows that it's difficult to obtain high-accuracy results by lifting over variants. But if your application focuses more on easily liftable regions, lifting over variants can be more efficient.

Looking forward to the support for ONT data. I tested using leviosam2 for variant calling with Clair3. The results show that leviosam2's lift-over from CHM13 to GRCh38 has slightly better performance for both HiFi (tested at 50x ) and ONT (tested at 34x) data compared to using GRCh38.
I also tried Picard LiftoverVcf to lift-over variants directly, but it has much worse performance. Currently, I think lift-over alignment is needed for the deferred part which is difficult to handle with chains and variants information provided. Thanks for your suggestions.

JH