muellan/metacache

Building a custom DB

donovan-parks opened this issue · 2 comments

Hi,

I'm aiming to build a custom MetaCache DB by specifying the NCBI Taxon IDs for my genomes through the assembly_summary.txt file. However, it appears MetaCache might make some assumptions about either the genome filenames or the contig IDs. I have the following test setup with 2 genomes:

Genome files:

A.fna
B.fna

assemby_summary.txt:

# Created by build_custom_metacache_db.py
# assembly_accession	taxid
A.fna	1423
B.fna	1423

MetaCache results:

> metacache build test_db genomes -taxonomy /data/ncbi_taxonomy
Building new database 'copper_custom_db' from reference sequences.
Max locations per feature set to 254
Reading taxon names ... done.
Reading taxonomic node mergers ... done.
Reading taxonomic tree ... 2450976 taxa read.
Taxonomy applied to database.
------------------------------------------------
MetaCache version    2.2.1 (20220112)
database version     20200820
------------------------------------------------
sequence type        mc::char_sequence
target id type       unsigned int 32 bits
target limit         4294967295
------------------------------------------------
window id type       unsigned int 32 bits
window limit         4294967295
window length        127
window stride        112
------------------------------------------------
sketcher type        mc::single_function_unique_min_hasher<unsigned int, mc::same_size_hash<unsigned int> >
feature type         unsigned int 32 bits
feature hash         mc::same_size_hash<unsigned int>
kmer size            16
kmer limit           16
sketch size          16
------------------------------------------------
bucket size type     unsigned char 8 bits
max. locations       254
location limit       254
------------------------------------------------
Reading sequence to taxon mappings from genomes/assembly_summary.txt
Reading sequence to taxon mappings from /data/wgs_pipeline/metacache_db/prok_viral/20211207/db_build_data/ncbi_taxonomy/assembly_summary_refseq.txt
Reading sequence to taxon mappings from /data/wgs_pipeline/metacache_db/prok_viral/20211207/db_build_data/ncbi_taxonomy/assembly_summary_refseq_historical.txt
Processing reference sequences.
Added 45 reference sequences in 3.623 s
targets              45
ranked targets       0
taxa in tree         2450976
------------------------------------------------
buckets              1500007
bucket size          max: 1 mean: 1 +/- 0 <> -nan
features             5472
dead features        0
locations            5472
------------------------------------------------
1 targets are unranked.
1 targets remain unranked.
Writing database to file ... Writing database metadata to file 'test_db.meta' ... done.
Writing database part to file 'test_db.cache0' ... done.
done.
Total build time: 4.477 s

The output suggests it is unable to identify my genomes since ranked targets is 0 and it indicates there is 1 target that remains unranked. Do I need to give the input genomes filenames with a certain format? Or do the contig IDs need to have a specific format? I've also tried changing the assembly__summary.txt file to just use the assembly_accession A and B without the .fna extension.

The two genome files and the assembly__summary.txt file are in the genome directory.

Any insight into what may be happening would be much appreciated.

Thanks,
Donovan

Hi Donovan!

You are correct, MetaCache expects to find an NCBI style accession number in the file name. This will be extracted and used for lookup in the assemby_summary.txt. You could use a dummy accession number for your files or the accession number of the first sequence in each file (if applicable).

Alternatively you could use the 2nd or 3rd option from the build guide to map each sequence to a taxid.

Thanks.