Unranked targets when building custom DB
donovan-h-parks opened this issue · 5 comments
Hi,
I'm building a custom DB from a large set of genome files. I'm indicating the TaxonId of each sequence using the NCBI-style accession2taxid tab-separated files
. When I build the DB, it appears some sequences are not being given a rank. Specifically, the output of metacache build
indicates 262383 targets remain unranked
. Does this mean that MetaCache identified and placed sequenced in the DB that do not have a taxonomic assignment (i.e. associated TaxonId)? Is there any way to determine which sequences remain unranked?
Thanks,
Donovan
>metacache build gtdb_r207_db ./gtdb_reps/genomes/ -taxonomy ./gtdb_taxonomy/R207 -reset-taxa -taxpostmap accn_to_taxid.tsv
Building new database 'gtdb_r207_db' from reference sequences.
Max locations per feature set to 254
Reading taxon names ... done.
Reading taxonomic node mergers ... done.
Reading taxonomic tree ... 401816 taxa read.
Taxonomy applied to database.
------------------------------------------------
MetaCache version 2.2.3 (20220708)
database version 20200820
------------------------------------------------
sequence type mc::char_sequence
target id type unsigned int 32 bits
target limit 4294967295
------------------------------------------------
window id type unsigned int 32 bits
window limit 4294967295
window length 127
window stride 112
------------------------------------------------
sketcher type mc::single_function_unique_min_hasher<unsigned int, mc::same_size_hash<unsigned int> >
feature type unsigned int 32 bits
feature hash mc::same_size_hash<unsigned int>
kmer size 16
kmer limit 16
sketch size 16
------------------------------------------------
bucket size type unsigned char 8 bits
max. locations 254
location limit 254
------------------------------------------------
Processing reference sequences.
Added 8302685 reference sequences in 14348.7 s
targets 8302685
ranked targets 0
taxa in tree 401816
------------------------------------------------
buckets 964032481
bucket size max: 254 mean: 46.7784 +/- 61.2667 <> 2.0026
features 518555067
dead features 0
locations 24257165919
------------------------------------------------
8302685 targets are unranked.
Try to map sequences to taxa using 'accn_to_taxid.tsv' (381 MB)
262383 targets remain unranked.
Writing database to file ... Writing database metadata to file 'gtdb_r207_db.meta' ... done.
Writing database part to file 'gtdb_r207_db.cache0' ... done.
done.
Does this mean that MetaCache identified and placed sequenced in the DB that do not have a taxonomic assignment (i.e. associated TaxonId)?
Yes, exactly.
Is there any way to determine which sequences remain unranked?
Unfortunately, there is no direct way. I think the best way would be to generate a list of all targets using
matacache info <database> lin
and check if the columns of the lineages are 0 (which means no valid TaxonId).
Hi,
Thanks for suggesting metacache info <database> lin
. I've been able to use this to identify that the issue relates to FASTA files from NCBI that have accessions of a specific form, e.g. NZ_CAJRAF010000001.1
. It appears MetaCache munges this name and records it as CAJRAF010000001.1
. This appears to be related to the length of the accession as NZ_AUDU01000044.1
is retained as NZ_AUDU01000044.1
.
Is this the expected operation of MetaCache and I should take care to remove any characters before an initial underscore from accessions >18 characters when providing a mapping file to -taxpostmap
? This seems like a brittle rule for me to implement so was hoping you could provide some details on how MetaCache modifies accessions.
Thanks,
Donovan
Hi,
accession numbers are a total mess. We use a regex to identify NCBI-style accession or accession.version sequence identifiers. For some reason that I don't remember we only allow the letter part to be 7 characters long (including the underscore).
If you want a super quick fix, go to the file "src/sequence_io.cpp" line 471 and replace the regex "(^|[^[:alnum:]])(([A-Z][_A-Z]{1,6}[0-9]{5,})(\\.[0-9]+)?)"
with "(^|[^[:alnum:]])(([A-Z][_A-Z]{1,9}[0-9]{5,})(\\.[0-9]+)?)"
(notice that the 6 got replaced with 9 which allows up to 10 letter characters at the start of the accession id) and recompile metacache.
That should solve your problem. I think we'll include such a change in the next release if it doesn't break anything.
André
Hi,
Thanks. I can look to implement the same regex expression to ensure consistency. Why is modifying accessions necessary? I'd like to add in my own genomes that don't necessarily have NCBI-style accessions. This is made a bit more complicated if I have to account for changes that might be made by MetaCache.
Thanks,
Donovan