DaehwanKimLab/centrifuge

Database download taking a lot of disk space + taking too long

pablorr24 opened this issue · 3 comments

I'm trying to build the database for my metagenomics analysis, and I've run the following commands but the database is taking too much disk space (bacteria was already 35gb and only 19% was downloaded).

centrifuge-download -o taxonomy taxonomy
centrifuge-download -o library -m -d "archaea,bacteria,viral" refseq > seqid2taxid.map

I had to stop the download as I was running out of disk space. Aren't the databases supposed to take way less disk space? Can someone guide me on the right commands to create the database.

The current refseq microbiome database is very huge, probably around 150GB nucleotide. This is the raw sequence size before building the index. The end index probably has a size around 80GB and also also requires about this amount of memory to run. Do you plan to run Centrifuge on your local machine or server? You need a large memory machine to create the index.

Thanks for the quick response :)
I'm currently working on my own machine, so I have limited space. Is there a way to have access to a smaller database or any other alternative that takes less than 50-60 gb?

You can try our newer method Centrifuger: https://github.com/mourisl/centrifuger. We have a recently-created (2023/06) index at: https://zenodo.org/records/10023239 about size 45GB, though it includes the human genome as part of the index. It should take less than 50gb space and memory to run.