iquasere/UPIMAPI

Download databases in certain path

AhmedElsherbini opened this issue · 7 comments

Hi authors,

Thank you for this interesting tool,

I am trying to use your tool with the Conda environment on a cluster computer.

So, as I did not download the databases for this tool, I just used this command for all of my bacterial genome collection

conda activate /beegfs/work/tu_bcoea01/upimapi


for d in *.faa ; do f=$(echo $d | sed -E "s/\.faa*//") ; mkdir $f ; upimapi -i $f.faa -o ./$f_U -t 28 ; done



I was expecting the tool to build the upimapi_resources folder either in the FAA files folder or in the conda environment's directory. Unfortunately, the tool started to build the databases folder in my main home directory (where I have a limited quota for desk space). leading to this error.

Traceback (most recent call last):
  File "/beegfs/work/tu_bcoea01/upimapi/bin/upimapi", line 1314, in <module>
    upimapi()
  File "/beegfs/work/tu_bcoea01/upimapi/bin/upimapi", line 1238, in upimapi
    build_reference_database(
  File "/beegfs/work/tu_bcoea01/upimapi/bin/upimapi", line 673, in build_reference_database
    download_uniprot(output_folder, mirror=mirror)
  File "/beegfs/work/tu_bcoea01/upimapi/bin/upimapi", line 663, in download_uniprot
    download_with_progress_bar(
  File "/beegfs/work/tu_bcoea01/upimapi/bin/upimapi", line 648, in download_with_progress_bar
    with open(f'{output_folder}/{url.split("/")[-1]}', 'wb') as file:
OSError: [Errno 122] Disk quota exceede

I tried to see if there is a way to force the tool to download the database in a certain path, but I did not find it.
for using SwissProt I see it is 91.6M, so no problem. But for UniProt, I see 61.3G, which makes stuff problematic.

Could it be a way to build UniProt database's preferred path?

Hey there! The parameter you are looking for is --resources-directory (shortened to -rd). Specifying it with -rd path/to/resources_folder will inform UPIMAPI to store the databases at that folder. UniProt has over 100 Gb (mostly from TrEMBL), so putting it on home might often not be the best solution, as you said.

Servers often have most of their space in mounts on HDD storages. I actually had the exact same problem, so I introduced this parameter. You can specify any folder for storing resources, as long as you assume you won't have files with the same name, because otherwise problems might arise. Better to use an empty folder.

Do warn if this doesn't solve your problem!

Hello,
Thanks for your prompt response,
So the argument helped move the database files to the desired directory.
However, the tool did not work as expected.
from this command

for d in *.faa ; do f=$(echo $d | sed -E "s/\.faa*//") ; mkdir $f ; upimapi -i $f.faa -o ./$f_U -t 28 --resources-directory /beegfs/work/tu_bcoea01/upimapi/db_u -db uniprot ; done

I got in the middle log file

2023-06-26 15:53:02: 2218 UniProt IDs identified as valid.
./uniprotinfo.tsv was found. Will perform mapping for the remaining IDs.
IDs present in uniprotinfo file: 1180
IDs missing: 2218
Information already gathered for 1180 ids. Still missing for 2218.
Retrieving UniProt information from 2218 IDs:   0%|          | 0/3 [00:00<?, ?it/s]
Retrieving UniProt information from 2218 IDs:  33%|███▎      | 1/3 [00:03<00:07,  3.91s/it]
Retrieving UniProt information from 2218 IDs:  67%|██████▋   | 2/3 [00:08<00:04,  4.03s/it]
Retrieving UniProt information from 2218 IDs: 100%|██████████| 3/3 [00:11<00:00,  3.77s/it]
Retrieving UniProt information from 2218 IDs: 100%|██████████| 3/3 [00:11<00:00,  3.83s/it]
Results for all IDs are available at ./uniprotinfo.tsv
2023-06-26 15:53:14: UPIMAPI analysis finished in 03h51m34s
/beegfs/work/tu_bcoea01/upimapi/db_u/uniprot.fasta exists. Overwrite? [Y/N] Traceback (most recent call last):
  File "/beegfs/work/tu_bcoea01/upimapi/bin/upimapi", line 1314, in <module>
    upimapi()
  File "/beegfs/work/tu_bcoea01/upimapi/bin/upimapi", line 1237, in upimapi
    if must_build_database(args.database, args.resources_directory):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/beegfs/work/tu_bcoea01/upimapi/bin/upimapi", line 689, in must_build_database
    return str2bool(input(f'{resources_folder}/{db2suffix[database]} exists. Overwrite? [Y/N] '))
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
EOFError: EOF when reading a line

This means that the loop is problematic. I got indeed a tsv file but not in the output folder of the first isolate.

  • How do suggest modifying the loop?

** Just first to give you a hint on what I am doing, I would like to compare two close bacterial species with your approach to see what is the difference between them and link it to their niche. Therefore, I do have 25 isolates of species A and 30 of species B. The total 55 faa protein files from Prokka are in one folder.

I would ask for your suggestions or advice. Why? The output of your tool (either Upiamapi or Reco or kegg), shall be merged for species A and species two to have a clean comparison between the two species. would it be feasible in your opinion?

The problem of EOFError: EOF when reading a line can be solved by setting the parameter --skip-db-check, which will just skip checking if uniprot exists, and not prompt the user to accept or not the existing file as the database. Since you are running a loop, you can't answer Y/N, so this parameter is ideal there.

About UPIMAPI not creating the files in the specified output folder ./$f_U, where is it putting those files then?

About your approach, it seems fine. These tools get more information than any other, even though they might take a bit more to run (because the databases are huge). I might advice you on running UPIMAPI with the -db taxids database parameter, and inputting the appropriate tax IDs with --taxids TAX1,TAX2,.... These parameters build a database from the tax IDs specified, and considering you have close bacterial species, it might make more sense to build a database just for the family or genus in question. This also helps speeding annotation by thousands fold.

If, after running UPIMAPI, you want to run reCOGnizer (which I would advise, as it gets complementary information with a different methodology), I might guide you in using the taxonomic information obtained with UPIMAPI to better tailor reCOGnizer's databases (as in unpublished results, I did found gets even better results, but it is still a very confusing approach).

Running KEGGCharter after might be interesting, but it really depends on what KEGGCharter is able to represent. I would always advice on experimenting it, since its results are validated by KEGG (even more if your species are available in KEGG Genomes), and may allow to see some function that one organism is lacking and the other has.

Thanks, João for your fast response,
for the output, I think I dropped the ball on the mkdir name, it should have $f_U also.
Anyhow, the file was in the same dir as the faa file.

So, if both my species being to the genus of Corynebacterium, my command shall be...

for d in *.faa ; do f=$(echo $d | sed -E "s/\.faa*//") ; mkdir $f ; upimapi -i $f.faa -o ./$f -t 28 --resources-directory /beegfs/work/tu_bcoea01/upimapi/db_u -db taxids  --taxids Bacteria,Actinomycetota,Actinomycetes,Mycobacteriales,Corynebacteriaceae,Corynebacterium --skip-db-check ; done

Seems also not working

#CPU threads: 28
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
Database input file: /beegfs/work/tu_bcoea01/upimapi/db_u/taxids_database.fasta
Opening the database file... 
Error opening file /beegfs/work/tu_bcoea01/upimapi/db_u/taxids_database.fasta: No such file or directory
diamond makedb --in /beegfs/work/tu_bcoea01/upimapi/db_u/taxids_database.fasta -d /beegfs/work/tu_bcoea01/upimapi/db_u/taxids_database.dmnd
Traceback (most recent call last):
  File "/beegfs/work/tu_bcoea01/upimapi/bin/upimapi", line 1314, in <module>
    upimapi()
  File "/beegfs/work/tu_bcoea01/upimapi/bin/upimapi", line 1243, in upimapi
    make_diamond_database(database, f"{'.'.join(database.split('.')[:-1])}.dmnd")
  File "/beegfs/work/tu_bcoea01/upimapi/bin/upimapi", line 565, in make_diamond_database
    run_command(f'diamond makedb --in {fasta} -d {dmnd}')
  File "/beegfs/work/tu_bcoea01/upimapi/bin/upimapi", line 548, in run_command
    run(bash_command.split(), check=True)
  File "/beegfs/work/tu_bcoea01/upimapi/lib/python3.11/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['diamond', 'makedb', '--in', '/beegfs/work/tu_bcoea01/upimapi/db_u/taxids_database.fasta', '-d', '/beegfs/work/tu_bcoea01/upimapi/db_u/taxids_database.dmnd']' returned non-zero exit status 1.

The error is because the --skip-db-check parameter makes UPIMAPI not check for the existence of the database. Therefore, if you use the parameter the first running - now with a different database, i.e., taxids_database.fasta - UPIMAPI won't find the database and throw the error. Instead, you would first run UPIMAPI without that parameter, and in further runs run with it. UPIMAPI will first build the TaxIDs database, and in further runs use it.

However... that genus has over 500000 sequences in TrEMBL. Therefore, I might advise you to, instead of making UPIMAPI download those sequences (which it would do through UniProt's API), download them yourself, through a browser. Then, you can specify to UPIMAPI the FASTA file as the database through the parameter -db /path/to/database.fasta.

Why should you manually download the database appropriate for your taxonomy, and not specify for UPIMAPI to do it? Because UniProt's API is going through some changes, and has gotten very slow recently. So if you specify UPIMAPI to build a taxIDs database with more than a few tens of thousands of sequences, it will take a long time.

So, finally, I get it done.
Small observation, using Uniprot API the taxi id number did not match the using the name of the genus only. Where the name of the genus resulted in more data. Therefore I used the genus name.

For the work, it worked first on one isolate then looped on everyone,

upimapi -i CA_A_4977.faa  -o CA_A_4977  -t 28 -db /beegfs/work/tu_bcoea01/upimapi/db_u/cor_uniprot.fasta
for d in *.faa ; do f=$(echo $d | sed -E "s/\.faa*//") ; mkdir $f ; upimapi -i $f.faa -o ./$f -t 28 -db /beegfs/work/tu_bcoea01/upimapi/db_u/cor_uniprot.fasta --skip-db-check ; done

For computation needs, with 64 GRAM and 28 processors, 53 minutes were needed to loop over 55 isolates. With no problem with each

For results, for the results out of 2226 proteins, I get annotation for 756, maybe this corresponds to the core proteins. let's see how

I just would suggest in the future giving the output files unique names as the input files. So I can move them easily to one folder.
However, each file is in a unique folder.

So, the next step will be running the recognizer on the faa files, but you mentioned that you can suggest here some guidance. waiting for this.

Glad it worked out for you!

So, reCOGnizer offers some parameters for using previously gathered taxonomic information. This is still not well documented, as it hasn't been tested for a long time, but you can give the --tax-file, --tax-col and --protein-id-col parameters of reCOGnizer the informations from UPIMAPI's output, i.e.,

--tax-file ./$f/UPIMAPI_results.tsv --tax-col "Taxonomic lineage IDs (SPECIES)" --protein-id-col qseqid --species-taxids

The --species-taxids parameter is just to inform that these are tax IDs of species, so COG can also be tailored to them.

These parameters will indicate, for each protein inputted, which taxonomy it belongs to. reCOGnizer will then build its databases separated by taxonomies, and for each Tax ID, it will annotate its proteins with the DB of the corresponding Tax ID, and all upper tax IDs.