a lighter database?
GaioTransposon opened this issue · 5 comments
Hi there and thank you for the tool,
is there an option to download only part of the database?
https://zenodo.org/record/5961398/files/db.tar.gz) is nearly 30GB and it takes about 12 hours to download (I am using bakta_db download --output .
with bakta installed with conda.
what if one just wants to use only one of the DBs (eg.: UniProtKB/Swiss-Prot: 2021_04) ?
Kind Regards
Dany
Hi Dany,
thanks for reaching out. Yes, DB size is sometimes and for some users an issue. As we decided to come up with a taxonomically untargeted approach and database, it has become fairly large.
The two largest parts of the DB are the PSC Diamond db (UniRef90 cluster representative sequences) and the SQLite db storing the ~200 million IPS sequence hashes (UniRef100) along with all pre-compiled annotations. Therefore, excluding many except of just one annotation DB wouldn't result in significant DB size reductions.
One option to reduce the databse size (that I already thought about) is to compile sub databases for certain phyla. Of course, that would imply a couple of things to develop, implement and test and thus would take its time on a mid term schedule. If this would be of interest for more users, we'd happily address that.
Another option would be to host the database on more servers that distributed around the globe and thus might provide more bandwidth and better download times. Might that help in your case? Do you know of any free hosting services that would be eligible?
Best regards,
Oliver
Another idea (inspired by @tseemann) is to use a ranked set of broader protein clusters. This could be addressed by skipping the IPS
and PSC
from the normal database and use a size-filtered subset of the PSCC
, only.
A quick check on Uniprot/UniRef50 revealed 2,660,356 UniRef50 proteins. I'd estimate a size reduction of the entire database down to let's say 3-4 Gb.
Hi @GaioTransposon,
fyi: you might be interested in v1.7.0 which introduces a light
database version as described in #196
This lightweight version is only 1.2 Gb zipped and 3 Gb unzipped.
EDIT: it was a fault conda installation (I think scales
was missing), it's working now :), and using the latst biocontainer build also now works :)
I just tried this with 1.7.0 but I get the following error (both via bioconda intsall conda tool, and also the corresponding singularity biocontainer)
$ bakta_db download --type light
Bakta software version: 1.7.0
Required database schema version: 5
fetch DB versions...
... compatible DB versions: 1
download database: v5.0, type=light, 2023-02-20, DOI: 10.5281/zenodo.7669534, URL: https://zenodo.org/record/7669534/files/db-light.tar.gz...
Traceback (most recent call last):
File "/home/jfellows/.conda/envs/bakta/lib/python3.10/site-packages/alive_progress/core/configuration.py", line 91, in validator
result = CONFIG_VARS[key](value)
KeyError: 'scale'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/jfellows/.conda/envs/bakta/bin/bakta_db", line 10, in <module>
sys.exit(main())
File "/home/jfellows/.conda/envs/bakta/lib/python3.10/site-packages/bakta/db.py", line 203, in main
download(db_url, tarball_path)
File "/home/jfellows/.conda/envs/bakta/lib/python3.10/site-packages/bakta/db.py", line 119, in download
with alive_bar(total=total_length, scale='SI') as bar:
File "/home/jfellows/.conda/envs/bakta/lib/python3.10/contextlib.py", line 135, in __enter__
return next(self.gen)
File "/home/jfellows/.conda/envs/bakta/lib/python3.10/site-packages/alive_progress/core/progress.py", line 95, in alive_bar
config = config_handler(**options)
File "/home/jfellows/.conda/envs/bakta/lib/python3.10/site-packages/alive_progress/core/configuration.py", line 82, in create_context
local_config.update(_parse(theme, options))
File "/home/jfellows/.conda/envs/bakta/lib/python3.10/site-packages/alive_progress/core/configuration.py", line 106, in _parse
return {k: validator(k, v) for k, v in options.items()}
File "/home/jfellows/.conda/envs/bakta/lib/python3.10/site-packages/alive_progress/core/configuration.py", line 106, in <dictcomp>
return {k: validator(k, v) for k, v in options.items()}
File "/home/jfellows/.conda/envs/bakta/lib/python3.10/site-packages/alive_progress/core/configuration.py", line 96, in validator
raise ValueError('invalid config name: {}'.format(key))
ValueError: invalid config name: scale
Did I miss something in my command, for example?
Conda environment creation: conda create -n bakta -c bioconda bakta
Yes, the 3rd party dependencies needed an update. It should work, now.