AFS20 database construction
Closed this issue · 4 comments
Hi,
I am trying to reproduce some of the experimental results on my own machine, but I got stuck in the database construction.
It seems that the default database script only builds the NCBI RefSeq (Release 90). Correct? Assuming that is true, I decided to build AFS20 using the commands for custom database construction since I want to see those results. So, I downloaded the first 20 reference genomes of Table 1 from the NCBI database, decompressed them, and put all the .fna files in one folder. Then I simply ran 1) make MACROS="-DMC_TARGET_ID_TYPE=uint32_t" 2) ./metacache build AFS20 Data/AFS20 -taxonomy genomes/taxonomy.
However, the program has got stuck in the "Processing reference sequences." phase forever (more than 10 hours). Can you tell me what I am doing wrong here?
I really appreciate any help you can provide.
Hard to tell what's wrong from your description, but it's definitely not normal that it takes so long.
- You should probably
make clean
andmake MACROS="-DMC_TARGET_ID_TYPE=uint32_t"
again - just to be on the safe side - What does the command
./metacache info
print? (just paste it in your answer) - If you try to build the database with the additional option
-verbose
- what does it print?
How do you provide the taxon ids for the genomes? (there's several ways, see here)
Thanks for the quick response.
I did do what you said. This is the output for ./metacache info
, after make clean
and make MACROS="-DMC_TARGET_ID_TYPE=uint32_t"
:
MetaCache version 1.1.1 (20200309)
database version 20200323
sequence type std::__cxx11::basic_string<char, std::char_traits, std::allocator >
target id type unsigned int 32 bits
target limit 4294967295
window id type unsigned int 32 bits
window limit 4294967295
window length 128
window stride 113
sketcher type mc::single_function_unique_min_hasher<unsigned int, mc::same_size_hash >
feature type unsigned int 32 bits
feature hash mc::same_size_hash
kmer size 16
kmer limit 16
sketch size 16
bucket size type unsigned char 8 bits
max. locations 254
location limit 254
hit classifier mc::best_distinct_matches_in_contiguous_window_ranges
In the case of adding -verbose
option, the last few lines will be like:
Reading sequence to taxon mappings from ncbi_taxonomy/assembly_summary_refseq.txt Reading sequence to taxon mappings from ncbi_taxonomy/assembly_summary_refseq_historical.txt Processing reference sequences. Copy_AFS20/GCA_000001635.8_GRCm38.p6_genomic.fna
And it will get stuck here.
Regarding the taxonomies, I downloaded and used NCBI's taxonomy and NCBI's bulk mapping files.
The info output looks normal.
So it seems like it gets stuck at the first file "Copy_AFS20/GCA_000001635.8_GRCm38.p6_genomic.fna". Can you try to run the build for this file only? (./metacache build test Copy_AFS20/GCA_000001635.8_GRCm38.p6_genomic.fna -taxonomy ncbi_taxonomy
)
I just did a test run with a fresh download of metacache and the genome and it finished in 150s. Maybe your genome download is corrupted, try downloading it again. You should also prefer the RefSeq versions of the genomes if available (e.g. GCF_000001635.26_GRCm38.p6_genomic.fna) or download the genbank assembly summaries using download-ncbi-taxonomy <target directory> genbank
for faster taxonomic mapping.
Can you give some information about your machine? Especially how much RAM and which OS and compiler you are using?
I see. Ok, I'll download the files in the RefSeq format and try again then. Thanks.
My current machine is a cluster machine with 192GB DDR4 RAM and Ubuntu 20.04.1 LTS, Kernel 5.4.0-48-generic, as its OS.