Building a custom DB
donovan-parks opened this issue · 2 comments
Hi,
I'm aiming to build a custom MetaCache DB by specifying the NCBI Taxon IDs for my genomes through the assembly_summary.txt
file. However, it appears MetaCache might make some assumptions about either the genome filenames or the contig IDs. I have the following test setup with 2 genomes:
Genome files:
A.fna
B.fna
assemby_summary.txt:
# Created by build_custom_metacache_db.py
# assembly_accession taxid
A.fna 1423
B.fna 1423
MetaCache results:
> metacache build test_db genomes -taxonomy /data/ncbi_taxonomy
Building new database 'copper_custom_db' from reference sequences.
Max locations per feature set to 254
Reading taxon names ... done.
Reading taxonomic node mergers ... done.
Reading taxonomic tree ... 2450976 taxa read.
Taxonomy applied to database.
------------------------------------------------
MetaCache version 2.2.1 (20220112)
database version 20200820
------------------------------------------------
sequence type mc::char_sequence
target id type unsigned int 32 bits
target limit 4294967295
------------------------------------------------
window id type unsigned int 32 bits
window limit 4294967295
window length 127
window stride 112
------------------------------------------------
sketcher type mc::single_function_unique_min_hasher<unsigned int, mc::same_size_hash<unsigned int> >
feature type unsigned int 32 bits
feature hash mc::same_size_hash<unsigned int>
kmer size 16
kmer limit 16
sketch size 16
------------------------------------------------
bucket size type unsigned char 8 bits
max. locations 254
location limit 254
------------------------------------------------
Reading sequence to taxon mappings from genomes/assembly_summary.txt
Reading sequence to taxon mappings from /data/wgs_pipeline/metacache_db/prok_viral/20211207/db_build_data/ncbi_taxonomy/assembly_summary_refseq.txt
Reading sequence to taxon mappings from /data/wgs_pipeline/metacache_db/prok_viral/20211207/db_build_data/ncbi_taxonomy/assembly_summary_refseq_historical.txt
Processing reference sequences.
Added 45 reference sequences in 3.623 s
targets 45
ranked targets 0
taxa in tree 2450976
------------------------------------------------
buckets 1500007
bucket size max: 1 mean: 1 +/- 0 <> -nan
features 5472
dead features 0
locations 5472
------------------------------------------------
1 targets are unranked.
1 targets remain unranked.
Writing database to file ... Writing database metadata to file 'test_db.meta' ... done.
Writing database part to file 'test_db.cache0' ... done.
done.
Total build time: 4.477 s
The output suggests it is unable to identify my genomes since ranked targets
is 0 and it indicates there is 1 target that remains unranked. Do I need to give the input genomes filenames with a certain format? Or do the contig IDs need to have a specific format? I've also tried changing the assembly__summary.txt
file to just use the assembly_accession
A and B without the .fna
extension.
The two genome files and the assembly__summary.txt
file are in the genome
directory.
Any insight into what may be happening would be much appreciated.
Thanks,
Donovan
Hi Donovan!
You are correct, MetaCache expects to find an NCBI style accession number in the file name. This will be extracted and used for lookup in the assemby_summary.txt
. You could use a dummy accession number for your files or the accession number of the first sequence in each file (if applicable).
Alternatively you could use the 2nd or 3rd option from the build guide to map each sequence to a taxid.
Thanks.