Unable to get MSA of conserved uORFS.

Question

Unable to get MSA of conserved uORFS.

Closed this issue 2 years ago · 14 comments

Dear uorf4u users,

I have an issue with the pipeline, and I don't know why. The installation went smoothly and is indeed very easy.

However, I am trying to run the pipeline with a nucleotide fasta file (bacterial chromosome sequence), and I get this error:

uorf4u -fa sequence.fasta

Fasta file with upstream sequences was saved to uorf4u_2022_10_31-08_57/upstream_sequences.fa.
🔎 ORFs annotating in the upstream sequences...
✅ 46630 ORFs were annotated.
🧹 13126 ORFs remained in the analysis after filtering by presence of the SD sequence.
💌 tsv files with information about annotated ORFs were saved to uorf4u_2022_10_31-08_57/annotated_ORFs folder.
🔎 Searching for conserved ORFs in upstream sequences...
✅ 13022 sets of conserved ORFs were found.
🧹 13022 set(s) of conserved ORFs remained in the analysis after filtering out duplicates.
🧮 Running MSA for conserved ORFs.
uORF4uError 💔: Unable to get MSA of conserved uORFS.

When I look at the tsv file, the columns 'extended_orfs' and 'annotation' are filled with 'NA' values, while the others seem to be completed normally.

Any idea of what could cause this error?

Thank you for the help!

Best,

Alexandre

Answer 1 · 2022-10-31T09:48:23.000Z

Hi!
Could you please run the same command with -debug parameter to see the error in more details? Sorry for confusing messages without traceback, we'll update them.
One idea: in MSA step it uses MAFFT which has a version for distribution in other tools. If you're using a linux machine, you need to run once: uorf4u --linux to change the path from macOS mafft version to linux'.

In addition: results look strange in terms of number of conserved sets. It can be in case you have only few sequences (for example 2 or a bit more but extremely similar) then almost all annotated ORFs will be called conserved.

Answer 2 · 2022-10-31T12:13:30.000Z

Hi,

thanks for the prompt answer. I am actually using macos Monterey, so the --linux option should not be the issue.

With the debug option, I get (this is a different chromosome, I took a shorter one in order to get faster, so don't worry about the ORF differences with the previous post):

💌 Fasta file with upstream sequences was saved to uorf4u_2022_10_31-11_17/upstream_sequences.fa.
🔎 ORFs annotating in the upstream sequences...
✅ 17695 ORFs were annotated.
🧹 4925 ORFs remained in the analysis after filtering by presence of the SD sequence.
💌 tsv files with information about annotated ORFs were saved to uorf4u_2022_10_31-11_17/annotated_ORFs folder.
🔎 Searching for conserved ORFs in upstream sequences...
✅ 4924 sets of conserved ORFs were found.
🧹 4924 set(s) of conserved ORFs remained in the analysis after filtering out duplicates.
🧮 Running MSA for conserved ORFs.
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/uorf4u/data_processing.py", line 920, in run_msa
path.maft_msa()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/uorf4u/data_processing.py", line 1450, in maft_msa
msa = Bio.AlignIO.read(temp_output.name, "fasta")
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/Bio/AlignIO/init.py", line 388, in read
raise ValueError("No records found in handle") from None
ValueError: No records found in handle

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/bin/uorf4u", line 42, in
upstream_seqs.run_msa()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/uorf4u/data_processing.py", line 923, in run_msa
raise uorf4u.manager.uORF4uError("Unable to get MSA of conserved uORFS.") from error
uorf4u.manager.uORF4uError: Unable to get MSA of conserved uORFS.

To me this sounds like a module / package / version issue with mafft. So I went to .../uorf4u_data/bin/mafft-mac/mafftdir/bin/mafft to check, and I got this message:

v0.000 != v7.505 (2022/Apr/10)

There is a problem in the configuration of your shell.
Check the MAFFT_BINARIES environmental variable by
$ echo $MAFFT_BINARIES

This variable must be unset, unless you have installed MAFFT
with a special configuration. To unset this variable, type
$ unset MAFFT_BINARIES
or
% unsetenv MAFFT_BINARIES
Then retry
$ mafft input > output

To keep this change permanently, edit setting files
(.bash_profile, .profile, .cshrc, etc) in your home directory
to delete the MAFFT_BINARIES line.
On MacOSX, also edit or remove the .MacOSX/environment.plist file
and then re-login (MacOSX 10.6) or reboot (MacOSX 10.7).

Please send a problem report to katoh@ifrec.osaka-u.ac.jp,
if this problem remains.

The issue here is that I never installed mafft before, and it is nowhere in my path. when I do

$ echo $MAFFT_BINARIES

I get no results.

The problem might then be in how uorf4u calls mafft?

Thanks,

Best,

Alexandre

Answer 3 · 2022-10-31T12:27:29.000Z

Hi again 👋🏻

Maybe this macOS version works only on apple silicon... Sorry, we haven't tested it.

Basically, uorf4u calls directly the binary file ./uorf4u_data/bin/mafft-mac/mafft.bat and uses subrocess for that. So, even if you have an installed mafft in any other place it shouldn't cause any problems.

That's fine that you're getting this funky output with running mafft script in path/uorf4u_data/bin/mafft-mac/mafftdir/bin/mafft, since it's redistributable for other packages mafft version and all interaction with the tool are performed through mafft.bat file.
So:
You can check whether mafft works with the following commands:
uorf4u --data (to copy uorf4u_data folder to your working directory)
./uorf4u_data/bin/mafft-mac/mafft.bat --help (to check if macOS version works)
If it doesn't, try 'linux' version with:
./uorf4u_data/bin/mafft-linux64/mafft.bat
If 'linux' version works then simply run uorf4u --linux once and everything should be fine. If it doesn't let me know, we'll fix it!

Cheers,
Artyom

Answer 4 · 2022-10-31T15:20:56.000Z

Hi again !

Thanks for the time you're devoting to this, the help is much appreciated!

So, all these command seem to be actually working. I also tried the --linux option, and I got the same error

--> 🧮 Running MSA for conserved ORFs.
uORF4uError 💔: Unable to get MSA of conserved uORFS.

To understand a bit better, I also ran the program on a different computer which runs under linux (Ubuntu). Here, I added the --linux parameters before running the program. However, I got the exact same error.

I also tried with 3 different sequences, just to be certain that my original sequence was not problematic, but again same mistake.

Any ideas about what is causing this?

Best,

Alexandre

Answer 5 · 2022-10-31T20:06:34.000Z

Hm... looks like magic..
I also tried on different not connected machines the current version and everything is fine.

You can see above that this problem can be caused only by incorrect output (error) from mafft (in this example, macOS version on a linux machine returns exactly the same). So, this problem occurs in my tests only if the wrong mafft version is used.
Things we can try to figure out what's happening on yours (i would recommend to test on linux, since we have to test the tool on MacOS with intel to be sure it works there as well):

Try to run everything from virtual environment (sometimes it's quite messy in standard env)
python3 -m venv env_uorf4u - create
source env_uorf4u/bin/activate - activate
Install uorf4u
python3 -m pip install uorf4u
If you run on linux machine, then:
uorf4u --linux
Copy uorf4u data to your working dir:
uorf4u --data #(will create uorf4u_data folder)
In this uorf4u_data folder, please, check uorf4u_prokaryotes.cfg file whether the line with mafft path is correct:
maft_binary = {internal}/bin/mafft-linux64/mafft.bat
Rerun your command, you can also specify this copied config file with -c parameter:
uorf4u -fa your_fasta -c uorf4u_data/uorf4u_prokaryotes.cfg -debug
Test mafft again manually if the previous doesn't work:
./uorf4u_data/bin/mafft-lunux64/mafft.bat --auto your_fasta > test_aln.fa
(i'm not sure that if it returns help message after run without arguments then it works in general. I tested -h as well and it works often, but running with real data returned errors. I think it will be the same for you)

Let's hope everything it will work
Also you can try to run the tool with this command that was shown in manual:
uorf4u -hl WP_202338192.1 WP_102227852.1 WP_034984371.1 WP_159316313.1 WP_095341278.1 WP_150861853.1 WP_011382144.1 WP_081624258.1 -c prokaryotes -annot -ul 400

Thank you very much for reporting the issue and helping to figure out the reason of this bug that we haven't caught on our machines.

Best

Answer 6 · 2022-10-31T20:32:50.000Z

‼️
I also updated uorf4u (new version 0.6.4) and replaced mafft version 7.505 with mafft 7.490. It was a mistake to include 7.505 since it's still under tests, the last stable mafft version is 7.490. It can help, I hope. So, during installation, please, check. that you have uorf4u version 0.6.4+.

yours 🙂

Answer 7 · 2022-11-01T14:16:30.000Z

Hi,

Thanks for putting the effort into this issue.

In macos environment (venv_uorf4u) , I followed your instruction and finally, it worked !

orf4u -fa test.fa -c uorf4u_prokaryotes.cfg -debug
Fontconfig warning: ignoring UTF-8: not a valid region tag
💌 Fasta file with upstream sequences was saved to uorf4u_2022_11_01-14_41/upstream_sequences.fa.
🔎 ORFs annotating in the upstream sequences...
✅ 45 ORFs were annotated.
🧹 11 ORFs remained in the analysis after filtering by presence of the SD sequence.
💌 tsv files with information about annotated ORFs were saved to uorf4u_2022_11_01-14_41/annotated_ORFs folder.
🔎 Searching for conserved ORFs in upstream sequences...
✅ 11 sets of conserved ORFs were found.
🧹 11 set(s) of conserved ORFs remained in the analysis after filtering out duplicates.
🧮 Running MSA for conserved ORFs.
💌 Sequences fasta files of conserved ORFs were saved to the folders:
uorf4u_2022_11_01-14_41/nucleotide_orfs_fasta_files,
uorf4u_2022_11_01-14_41/amino_acid_orfs_fasta_files.
💌 MSA fasta files of conserved ORFs were saved to the folders:
uorf4u_2022_11_01-14_41/nucleotide_msa_fasta_files,
uorf4u_2022_11_01-14_41/amino_acid_msa_fasta_files,
uorf4u_2022_11_01-14_41/sd_msa_fasta_files folders.
💌 Results summary tsv table saved to: uorf4u_2022_11_01-14_41/results_summary.tsv.
🎨 Loci annotations figures plotting...
💌 Annotation figures were saved to the folder: uorf4u_2022_11_01-14_41/annotation_visualisation
🎨 Sequence logo figures plotting...
...

However, on the linux computer, enven with the virtual environment, it did not work. Now, there can be two things. First, I am using a "real" sequence and not a subset on the linux machine, so this could maybe cause some issue ( I really doubt that). Second, and more probably, I remarked that mafft versions are acutally different vith uorf4u v 0.6.4:

uorf4u_data/bin/mafft-linux64/mafftdir/bin//mafft

v0.000 != v7.505 (2022/Apr/10)

uorf4u_data/bin/mafft-mac/mafftdir/bin/mafft

v0.000 != v7.490 (2021/Oct/30)

So could it actually be a mafft version problem which causes the trouble?

In any case, I will run the "real" (i.i bigger) dataset with the macos version, to see whether it crashes again or not.

Thank you again,
Best,

Alexandre

Answer 8 · 2022-11-01T14:22:37.000Z

Hi!

Great that it works on Mac now! ✨ Thanks! (It seems that it was exactly because of alpha version of mafft we used before)
About linux: are you running it also from virtual env? Maybe uorf4u can't overrides mafft path in the configuration files because of some permission problems...?
Did you check mafft path' line in config files that you can copy to your woking dir with uorf4u --data?
I think if it wasn't changed it can explain everything.

Best,
Artyom

Answer 9 · 2022-11-01T14:31:31.000Z

H again,

Thanks to you for your help, I am quite happy it at least runs on macos now !

Yes I am also running on virtual environment on linux. and I did copy the files with uorf4u --data and also checked the config file. I also checked the version and it is indeed 0.6.4, which confirms the "new installation" compared to yesterday's. But when I go in /mafft-linux, it is still version 7.05, while /mafft-macos is 7.490.

Best,

Alex

Answer 10 · 2022-11-01T15:40:11.000Z

Yep, that's true.. It seems that it was accidentally replaced with the same linux version.
We have new version 0.7.0 now with major NCBI db parsing updates. It also includes the proper linux version of mafft. Let's hope the problem with linux is the same)
If you're not tired of running the same script we will be really thankful if you find time to test it 😀 I tested on 3 linux machines, it works, but it wasn't ubuntu.
Thanks again!

Best,
Artyom

Answer 11 · 2022-11-02T08:56:14.000Z

Hello,

No worries, I will try the 0.70 version on linux later today and let you know what is going on.

However, I do have some other "bad" news. So as I wrote yesterday, the program worked on the small subset. However, it crashes again when I use longer sequences. I re-checked with exact same line, version, virtual environment, and it works again for the subset but not the full length.

####Subset Run#####

🔎 ORFs annotating in the upstream sequences...
✅ 20 ORFs were annotated.
🧹 4 ORFs remained in the analysis after filtering by presence of the SD sequence.
💌 tsv files with information about annotated ORFs were saved to uorf4u_2022_11_02-09_48/annotated_ORFs folder.
🔎 Searching for conserved ORFs in upstream sequences...
✅ 4 sets of conserved ORFs were found.
🧹 4 set(s) of conserved ORFs remained in the analysis after filtering out duplicates.
🧮 Running MSA for conserved ORFs.
💌 Sequences fasta files of conserved ORFs were saved to the folders:
uorf4u_2022_11_02-09_48/nucleotide_orfs_fasta_files,
uorf4u_2022_11_02-09_48/amino_acid_orfs_fasta_files.
💌 MSA fasta files of conserved ORFs were saved to the folders:
uorf4u_2022_11_02-09_48/nucleotide_msa_fasta_files,
uorf4u_2022_11_02-09_48/amino_acid_msa_fasta_files,
uorf4u_2022_11_02-09_48/sd_msa_fasta_files folders.
💌 Results summary tsv table saved to: uorf4u_2022_11_02-09_48/results_summary.tsv.
🎨 Loci annotations figures plotting...
[...]

####Chromosome Run #####

uorf4u -c uorf4u_prokaryotes.cfg -fa -chrom.fa -debug
Fontconfig warning: ignoring UTF-8: not a valid region tag
💌 Fasta file with upstream sequences was saved to uorf4u_2022_11_01-11_41/upstream_sequences.fa.
🔎 ORFs annotating in the upstream sequences...
✅ 64325 ORFs were annotated.
🧹 18051 ORFs remained in the analysis after filtering by presence of the SD sequence.
💌 tsv files with information about annotated ORFs were saved to uorf4u_2022_11_01-11_41/annotated_ORFs folder.
🔎 Searching for conserved ORFs in upstream sequences...
✅ 12302 sets of conserved ORFs were found.
🧹 12302 set(s) of conserved ORFs remained in the analysis after filtering out duplicates.
🧮 Running MSA for conserved ORFs.
2Traceback (most recent call last):
File "/env_uorf4u/lib/python3.8/site-packages/uorf4u/data_processing.py", line 920, in run_msa
path.maft_msa()
File "/env_uorf4u/lib/python3.8/site-packages/uorf4u/data_processing.py", line 1450, in maft_msa
msa = Bio.AlignIO.read(temp_output.name, "fasta")
File "/env_uorf4u/lib/python3.8/site-packages/Bio/AlignIO/init.py", line 388, in read
raise ValueError("No records found in handle") from None
ValueError: No records found in handle

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/Users/lemopoul/Documents/Unix/env_uorf4u/bin/uorf4u", line 42, in
upstream_seqs.run_msa()
File "/Users/lemopoul/Documents/Unix/env_uorf4u/lib/python3.8/site-packages/uorf4u/data_processing.py", line 923, in run_msa
raise uorf4u.manager.uORF4uError("Unable to get MSA of conserved uORFS.") from error
uorf4u.manager.uORF4uError: Unable to get MSA of conserved uORFS
#######

Can this be a memory usage issue ? My sequence is a full length chromosome of bacterium, so ~4MB.

Sorry for causing so much trouble, and thanks for the help !

Best,

Alexandre

Answer 12 · 2022-11-02T10:39:29.000Z

Hi again!

Actually input with one sequence is a good test for the tool but it can't be used for getting anything interesting.
I will clarify it in documentation because it can be confusing. See our workflow figure and example-driven guide documentation for usage examples or ask any questions here on in discussion section :)
In short:
The tool searches for conserved ORFs that presented in a set of sequences (number of similar ORFs (one per sequence), is above than a cutoff, by default in >40%). So, the current version just annotates all ORFs in this one sequence and calls them conserved. Now it returns error if input is just one sequence, but, it also returns a table with annotated ORFs.

In addition, it's not supposed to use such long sequences (for example with input with several chromosomes it will be extremely inefficient since the tool uses a greedy algorithm searching. There are other tool for ORFs annotation in long sequences based on other methods (see our article)).
Recommended length of sequences ~100-1000 with number of sequences ~10-700.

About error: it's strange especially if it worked before and if mafft in general works. I'm not sure that it can be a memory problem. But it was nonsense anyway since it was aligning one sequence (now it won't happen).
In a new version in case of mafft crash it will also return its message to see what's happening inside.

Best,
Artyom

Answer 13 · 2022-11-03T10:33:44.000Z

Hi again,

Ok, I get it, too bad, I was thinking it could be a geat tool for this type of use.

So, about the linux version, sorry I said I would run it yesterday but got caught up by some other stuff. Anyhow, I ran uorf4u v.0.8.0 on the linux computer (Ubuntu), on virtual environment, as follows :

uorf4u -fa test.fasta -c uorf4u_bacteria.cfg -debug

And it is not working here. It actually cannot find the paths of palettes, nt_colors etc..., as they are specified in the config file.

However, when running

uorf4u -fa test.fasta -debug

it works nicely. And I tried running it with only one sequence and got the subsequent error message. When running with 2 sequences, it worked, so this will be clear for future users.

Best,

Alexandre

Answer 14 · 2022-11-03T10:53:48.000Z

Hi, that's fine :) But i didn't fully get what you expected to get with one big fasta sequence?
Yep, with uorf4u -fa test.fasta -c uorf4u_bacteria.cfg -debug command there is a problem with path: a config file should be located in uorf4u_data folder. The reason is that paths to palettes and other things are related to config by default and the main idea that when you copy the uorf4u_data folder you can safely change anything inside of it without affection internal set of files. If you take only a config file a put it in another folder it will crash. it can be fixed manually by specifying new paths in the config file. We'll make it clear in documentation, thanks!