alekseyzimin/EviAnn_release

Detection of Pseudogenes failed, error with proteins_all.faa file

Opened this issue · 8 comments

Hello,

I want to use EviAnn for some test data I have for Citrus Sinensis and I followed the instructions on the README page but I am encountering an issue I have not been able to resolve. I keep getting this error:

[Mon Jul 22 19:05:39 UTC 2024] Aligning proteins
Error with file '.../proteins_all.faa'
[Mon Jul 22 19:05:40 UTC 2024] Filtering protein alignment file
[Mon Jul 22 19:05:42 UTC 2024] Running exonerate on the filtered sequences
[Mon Jul 22 19:05:42 UTC 2024] Detecting and annotating processed pseudogenes
[Mon Jul 22 19:05:42 UTC 2024] Detection of pseudogenes failed

I prepared my RNA-seq data and protein homology data as instructed. I ran the sample command. Could this possibly be an issue with exonerate?

Hello, ".../proteins_all.faa" does not look like a valid path, maybe you meant "../proteins_all.faa" ?

Hello,

I was actually able to fix this problem. I had another issue I wanted to ask about regarding tblastn and exonerate. When running it as is it takes several hours to run, and it never ran to completion. I tweaked the tblastn command by adding -subject_besthit to it (I added it to the command found in the eviprot.sh script) and it was able to run. However, like tblastn, once the pipeline reached exonerate it also took a considerable amount of time. The genome I'm working with is around 371 M, and the protein file as around 22M. I was wondering if this is a normal amount of time for a genome and protein file of this size and if there are ways to ensure tblastn runs to completion. Thank you!

eviprot is the longest part of the pipeline, it takes 2-3 days to align about 500Mb of protein sequences to a 2.5Gbp mouse genome on a 24-core Intel Xeon server. -subject_besthit option will reduce sensitivity a lot. Aligning 22Mb of protein sequence to 371Mbp genome should be relatively trivial, 2-3 hours at the most. What computer are you using (cores/RAM)?

I am running this on a 32 core server with 246 g of RAM total. For more detail, when running tblastn without -subject_besthit, it was able to convert all the .tmp files to .out files except for one batch. This one batch is what kept tblastn running indefinitely.

Thank you, I will re-check running with this option (-subject_besthit), maybe I am confusing it with something else, because according to description it should not be harmful to the result.

I confirm, using option -subject_besthit in tblastn does not affect the results. I will include it into the next release.

Please check out the new release I just posted