agshumate/Liftoff

How to relax filter for maximum distance between two nodes

Closed this issue · 5 comments

Thanks for creating this very useful tool!

I'm working with a group of species whose genomes vary considerably in size. I was trying to use Liftoff and the reference genome from which I'm trying to lift the gene annotations is the smallest one. I've noticed that for many genes, some CDS annotations are missing but when I run minimap2 for these genes all CDS features maps and the reason they're not included in the lifted annotation is that the distance between two CDS annotations is far greater than in the reference (e.g. 5 kb vs. 150 bp).

I was thus wondering if there is a way to relax the 4th filter when connecting two nodes "4) The distance from the start of u to the end of v in the target genome is no greater than 2 times that in the reference genome). I don't find any option to do so.

Thank you!

hi there,
I have just published a new release (1.3.0) with a new distance scaling factor option (-d ) that allows you to relax this parameter. the default is 2 consistent with the paper. I will state a disclaimer though that the development and testing of liftoff has been focused on lifting genes between assemblies of similar sizes. relaxing this parameter too much may cause genes to map spanning large distances incorrectly if there are many equally good alignments of exons/CDSs. I think this would be rare though so I am eager to hear if this helps your efforts! Please let me know if you have other questions/comments.

Hi, thank you!

In the meantime, I tried to make some changes in the code: allowing the distance between to nodes to be either i) 2 times the expected distance (the default) or ii) some fixed distance (whichever was the maximum of the two). While it doesn't change (much) the number of genes that are annotated in the new assemblies, it helps improve slightly the number of annotations with correct ORFs (which is what I'm most concerned about).

Still, I'm a bit concerned that even for the genome assembly of the same species as that of the reference genome, there's a 25% drop in the number of correct ORFs. My intuition, based on the few genes I've looked into so far, is that it's in part because some exons are not being included. This is interesting because when I map the exon sequences instead of the genes, using minimap2, I'm able to recover all exons for these genes that I've tested. So, I'm guessing that ideally there could be an extra step to polish the gene annotations, perhaps by also considering the mapping of the exon sequences or considering the flanking regions to find missing exons? Do you have any recommendations about this?

Thank you!

hi,
I am curious about what species you are working with that has such variable genome sizes from the reference? and just to clarify, when you say that you are able to map the exon sequences, are you doing a spliced alignment of the transcripts?

Hi,

I'm working with Heliconius butterflies. As an example, the genome sizes for the two species which have a reference assembly, are ~275 and 380 Mb long.

What I did was to extract each of the CDS sequences for a gene and map these.

i see thanks. We made the decision to align the complete gene sequences and then convert the exon coordinates because of the fact that spliced alignment is imperfect when resolving intron/exon structures especially with small exons. We wanted to avoid this limitation considering we already know what the intron/exon structure is from the reference. This strategy of course comes with the assumption that the genes in the target genome are very similar to the reference both in sequence and in size. Target genomes where the CDS features are orders of magnitude farther apart than the reference, may simply be too different for a lift-over strategy to be accurate.