muellan/metacache

AFS20 database construction

Closed this issue · 4 comments

Hi, 

I am trying to reproduce some of the experimental results on my own machine, but I got stuck in the database construction.

It seems that the default database script only builds the NCBI RefSeq (Release 90). Correct? Assuming that is true, I decided to build AFS20 using the commands for custom database construction since I want to see those results. So, I downloaded the first 20 reference genomes of Table 1 from the NCBI database, decompressed them, and put all the .fna files in one folder. Then I simply ran 1) make MACROS="-DMC_TARGET_ID_TYPE=uint32_t" 2) ./metacache build AFS20 Data/AFS20 -taxonomy genomes/taxonomy.

However, the program has got stuck in the "Processing reference sequences." phase forever (more than 10 hours). Can you tell me what I am doing wrong here?

I really appreciate any help you can provide.

Hard to tell what's wrong from your description, but it's definitely not normal that it takes so long.

  • You should probably make clean and make MACROS="-DMC_TARGET_ID_TYPE=uint32_t" again - just to be on the safe side
  • What does the command ./metacache info print? (just paste it in your answer)
  • If you try to build the database with the additional option -verbose - what does it print?

How do you provide the taxon ids for the genomes? (there's several ways, see here)

Thanks for the quick response. 

I did do what you said. This is the output for ./metacache info , after make clean and make MACROS="-DMC_TARGET_ID_TYPE=uint32_t":


MetaCache version 1.1.1 (20200309)
database version 20200323

sequence type std::__cxx11::basic_string<char, std::char_traits, std::allocator >
target id type unsigned int 32 bits
target limit 4294967295

window id type unsigned int 32 bits
window limit 4294967295
window length 128
window stride 113

sketcher type mc::single_function_unique_min_hasher<unsigned int, mc::same_size_hash >
feature type unsigned int 32 bits
feature hash mc::same_size_hash
kmer size 16
kmer limit 16
sketch size 16

bucket size type unsigned char 8 bits
max. locations 254
location limit 254

hit classifier mc::best_distinct_matches_in_contiguous_window_ranges

In the case of adding -verbose option, the last few lines will be like:
Reading sequence to taxon mappings from ncbi_taxonomy/assembly_summary_refseq.txt Reading sequence to taxon mappings from ncbi_taxonomy/assembly_summary_refseq_historical.txt Processing reference sequences. Copy_AFS20/GCA_000001635.8_GRCm38.p6_genomic.fna
And it will get stuck here.

Regarding the taxonomies, I downloaded and used NCBI's taxonomy and NCBI's bulk mapping files.

The info output looks normal.

So it seems like it gets stuck at the first file "Copy_AFS20/GCA_000001635.8_GRCm38.p6_genomic.fna". Can you try to run the build for this file only? (./metacache build test Copy_AFS20/GCA_000001635.8_GRCm38.p6_genomic.fna -taxonomy ncbi_taxonomy)

I just did a test run with a fresh download of metacache and the genome and it finished in 150s. Maybe your genome download is corrupted, try downloading it again. You should also prefer the RefSeq versions of the genomes if available (e.g. GCF_000001635.26_GRCm38.p6_genomic.fna) or download the genbank assembly summaries using download-ncbi-taxonomy <target directory> genbank for faster taxonomic mapping.

Can you give some information about your machine? Especially how much RAM and which OS and compiler you are using?

I see. Ok, I'll download the files in the RefSeq format and try again then. Thanks.

My current machine is a cluster machine with 192GB DDR4 RAM and Ubuntu 20.04.1 LTS, Kernel 5.4.0-48-generic, as its OS.