Detection of Pseudogenes failed, error with proteins_all.faa file

Question

Detection of Pseudogenes failed, error with proteins_all.faa file

Opened this issue 4 months ago · 8 comments

Hello,

I want to use EviAnn for some test data I have for Citrus Sinensis and I followed the instructions on the README page but I am encountering an issue I have not been able to resolve. I keep getting this error:

[Mon Jul 22 19:05:39 UTC 2024] Aligning proteins
Error with file '.../proteins_all.faa'
[Mon Jul 22 19:05:40 UTC 2024] Filtering protein alignment file
[Mon Jul 22 19:05:42 UTC 2024] Running exonerate on the filtered sequences
[Mon Jul 22 19:05:42 UTC 2024] Detecting and annotating processed pseudogenes
[Mon Jul 22 19:05:42 UTC 2024] Detection of pseudogenes failed

I prepared my RNA-seq data and protein homology data as instructed. I ran the sample command. Could this possibly be an issue with exonerate?

Answer 1 · 2024-07-29T14:36:35.000Z

Hello, ".../proteins_all.faa" does not look like a valid path, maybe you meant "../proteins_all.faa" ?

Answer 2 · 2024-07-29T16:13:02.000Z

Hello,

I was actually able to fix this problem. I had another issue I wanted to ask about regarding tblastn and exonerate. When running it as is it takes several hours to run, and it never ran to completion. I tweaked the tblastn command by adding -subject_besthit to it (I added it to the command found in the eviprot.sh script) and it was able to run. However, like tblastn, once the pipeline reached exonerate it also took a considerable amount of time. The genome I'm working with is around 371 M, and the protein file as around 22M. I was wondering if this is a normal amount of time for a genome and protein file of this size and if there are ways to ensure tblastn runs to completion. Thank you!

Answer 3 · 2024-07-29T16:52:07.000Z

eviprot is the longest part of the pipeline, it takes 2-3 days to align about 500Mb of protein sequences to a 2.5Gbp mouse genome on a 24-core Intel Xeon server. -subject_besthit option will reduce sensitivity a lot. Aligning 22Mb of protein sequence to 371Mbp genome should be relatively trivial, 2-3 hours at the most. What computer are you using (cores/RAM)?

Answer 4 · 2024-07-29T17:30:31.000Z

I am running this on a 32 core server with 246 g of RAM total. For more detail, when running tblastn without -subject_besthit, it was able to convert all the .tmp files to .out files except for one batch. This one batch is what kept tblastn running indefinitely.

Answer 5 · 2024-07-29T18:31:11.000Z

Thank you, I will re-check running with this option (-subject_besthit), maybe I am confusing it with something else, because according to description it should not be harmful to the result.

Answer 6 · 2024-07-29T21:54:13.000Z

Thank you for checking in on this issue. I appreciate the help!

…

On Mon, Jul 29, 2024 at 11:31 AM Aleksey Zimin ***@***.***> wrote: Thank you, I will re-check running with this option (-subject_besthit), maybe I am confusing it with something else, because according to description it should not be harmful to the result. — Reply to this email directly, view it on GitHub <#8 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AZZQ3XOMFH7U5K3O7UNVV7TZO2DALAVCNFSM6AAAAABLJADLLWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJWGYZTGMRRHA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

Answer 7 · 2024-07-30T12:17:55.000Z

I confirm, using option -subject_besthit in tblastn does not affect the results. I will include it into the next release.

Answer 8 · 2024-07-30T13:52:29.000Z

Please check out the new release I just posted