agshumate/Liftoff

What happens if there are multiple copies of a gene in reference gff

Closed this issue · 5 comments

Hi,

Thanks for making this quick tool, it's been working well for me.
I have lifted-over the annotation of one plant species to another closely related one and I've found that about 60 % of the multi-copy genes in the new annotation have 15 - 30 additional copies.
This is quite unexpected, as you would expect a decrease in number of genes with higher copy number, and I wanted to make sure I didn't do something wrong. The gff annotations I ran liftoff with may contain multiple copies of the same gene with unique ID's. How will liftoff behave in such a case? Should I remove multiple copies of the same gene in the annotation file for an accurate estimation of additional copies in the target genome?

Thanks :)

Hi,
You do not need to remove multiple copies of the same gene, as long as they have unique IDs. Liftoff attempts to map every gene from the reference first before looking for additional copies. . It will only annotate an extra copy if no other gene from the reference is annotated at that location. so if for example there are 2 identical copies annotated in the reference, it will find those before annotating an extra 3rd copy. You can check this by pulling out all of the original gene IDs from the new target gff and the additional copies and confirm that they do not overlap one another. If this is not the case please let me know and we can look more at your specific data.

Hi!

Indeed they don't seem to overlap on the target genome. Thank you for clarifying this.

On another note, I redid the analysis with version 1.2.1 (originally done with 1.1.1) and this resulted in very different copy numbers. While the run in 1.1.1 resulted in over 4000 genes with 2 copies or more, I now only get 215 multi-copy genes. The enrichment of genes with 15 - 30 copies has disappeared in this version. I also get a much higher number of mapped genes (originally around 80 % of the annotation, now around 89 %. I am aware that version 1.1.1 had a bug that left out mapped genes, but how has this an effect on the number of multi-copy genes?

hi,
i cannot say for sure without looking at your specific data, but my best explanation is due to liftoff's requirement that genes cannot map to overlapping loci. when genes were being incorrectly left out in v1.1.1 then there are more open loci to map extra copies to.

Hi,

This could indeed be the case, and there might be many multi-copy genes in my annotation which "took the place" of the missing genes. But I still don't understand how this led to a new file with 103,000 genes in v1.1.1, when in v1.2.1 this was only 32,313 genes. The original gff file had 33,899 genes. This means that in v1.1.1 many many more open loci were found than the 4,242 extra loci of missing genes. And this was done with the same parameters (-s 0.6, -a 0.6, -sc 0.99). When increasing -a and -s to 0.9 in v.1.2.1 I get exactly the same results.

I would be happy to send you the gff files, am just not sure where I should put these (or by email). Let me know!

sure if you're gff files are not too large you can email them to me at ashumate@jhmi.edu.