The second file of customer db is empty

Question

The second file of customer db is empty

LilyAnderssonLee opened this issue a year ago · 4 comments

Hi, I am In the process of building a database using RefSeq data that covers bacteria, viral, archaea, fungi, parasite, protoza, plasmid and even contaminants. The input data is quite large, around 1.3TB in size.

However, I've run into an issue where the second file db.2.cf, always turns out empty. Has anyone else had this problem? Here is the code I've been using:

#!/bin/bash
#SBATCH -A xx
#SBATCH -p core
#SBATCH -n 50
#SBATCH -t 10-00:00:00
#SBATCH -J centrifuge_db
#SBATCH --mem=400GB
centrifuge-build -p 50 --bmax 3342177280 --conversion-table seqid2taxid.map
--taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp
input-sequences.fna db

Answer 1 · 2023-11-07T19:57:58.000Z

I think for 1.3TB sequences, you may need about 3TB memory to build the index...

Answer 2 · 2023-11-09T11:50:19.000Z

@mourisl Thanks for your response. It's sad that I don't have sufficient memory available. I suppose I'll need to reduce the data size, perhaps by only keeping the representative genome for each species.

Answer 3 · 2023-11-22T10:28:57.000Z

@mourisl I am wondering what is the k-mer length used during genomes compression in the centrifuge database h+p+v+c or what is the default k-mer in database construction?

Are you planning to update the Centrifuge databases or create Centrifuge databases based on all RefSeq genomes?

Answer 4 · 2023-11-22T15:22:09.000Z

Centrifuge itself does not use k-mers. For the compression part, it use 31-mers, but this k-mer is used to cluster more similar strains from the species, so the information is not directly used in the compression either.

For the recent RefSeq prokaryotic genomes, the size is too huge, and the index size is above 80GB, which is beyond the limit from Zenodo...