The second file of customer db is empty
LilyAnderssonLee opened this issue · 4 comments
Hi, I am In the process of building a database using RefSeq data that covers bacteria, viral, archaea, fungi, parasite, protoza, plasmid and even contaminants. The input data is quite large, around 1.3TB in size.
However, I've run into an issue where the second file db.2.cf
, always turns out empty. Has anyone else had this problem? Here is the code I've been using:
#!/bin/bash
#SBATCH -A xx
#SBATCH -p core
#SBATCH -n 50
#SBATCH -t 10-00:00:00
#SBATCH -J centrifuge_db
#SBATCH --mem=400GB
centrifuge-build -p 50 --bmax 3342177280 --conversion-table seqid2taxid.map
--taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp
input-sequences.fna db
I think for 1.3TB sequences, you may need about 3TB memory to build the index...
@mourisl Thanks for your response. It's sad that I don't have sufficient memory available. I suppose I'll need to reduce the data size, perhaps by only keeping the representative genome for each species.
Centrifuge itself does not use k-mers. For the compression part, it use 31-mers, but this k-mer is used to cluster more similar strains from the species, so the information is not directly used in the compression either.
For the recent RefSeq prokaryotic genomes, the size is too huge, and the index size is above 80GB, which is beyond the limit from Zenodo...