Sets up a PostgreSQL database based by default on the latest taxonomy files provided by NCBI and the nt
reference dataset. It resolves for each accession the taxonomic ID (taxID) it is assigned to by name.
tactac provides functions that let's you
- annotate fasta headers by their accessions, e.g. as needed for lowest common ancestor computations
- collect all accessions for a given taxID, e.g. for computing reference coverage rates for taxonomic branches or for subsampling reference sequences
- taxonomic binning, i.e. distribute the library into
1 << x
bins (fasta files), such that a bin represents a clade, e.g. as pre-processing step for YARA
- Operating Systems: Linux, MacOS
- Python3
- PostgreSQL Database Server
- Optionally a PostgreSQL database client, e.g. SQL Tabs
Start the PostgreSQL Server in its default configuration. If you want to change configuration settings like io paths, connection details, etc., overwrite config.py
. See the complete set of options with
python tactac.py --help
tactac
relies on a set of files provided by NCBI, namely, nodes.dmp
, names.dmp
, and taxidlineage.dmp
in order to create the full taxonomy, and assign names to taxIDs.
python tactac.py --download tax
tactac
will resolve NCBI accession IDs to taxonomic identifiers. In order to do so, per default the complete non-human nucleotide data set in fasta format will be downloaded (nt
). You can change the reference data set by editing the provided configuration file.
python tactac.py --download ref
Per default the latest nucl_gb.accession2taxid.gz
file from ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid
will be downloaded. To modify the edit config.py
.
python tactac.py --download acc2tax
Alternatively, to download all files at once use the all
option.
python tactac.py --download all
Start running the installed PostgreSQL server. With the build
option a new database will be created. It is assumed that the default settings are
Server name | Host Name | Port | User Name |
---|---|---|---|
'local' | 'localhost' | 5432 | 'postgres' |
python tactac.py --build --password <pwd>
In case the Accessions
table filling got interrupted (nugc_gb.accession2taxid
currently contains 258,753,642 entries), you can continue the filling by setting the --continue
flag. Optionally, you can give a hint terms of last successfully inserted accession. The parser will jump to the succeeding line and omit database queries to test the presence. Note that continuation leaves the other three tables, i.e. Node
, Names
, Lineage
, unmodified!
python tactac.py --build --continue [<last_ins_>] --password <pwd>
or with personalized configurator
Give a single accession number (without the version prefix) as command line argument or file. The file format has to be either in fasta
format which may also contain sequence data to be ignored or be a list of accessions (one per row). Whenever input values are given directly in terminal, output will also be printed in terminal, otherwise a result file is generated.
python tactac.py --acc2tax [<file>|<acc>]
For each taxonomic node in the subtree given by acc
via command line or listed row-wise in a file, the list of directly associated accession will be output in the row format taxid,acc1,acc2,...
. The output behaviour is the same as for acc2tax
.
python tactac.py --tax2acc [<file>|<acc>]
Splits your reference library (e.g. given by the raw nt
fasta file) with respect to the taxonomy. This is an important pre-processing step to apply taxonomy-aware indexing with DREAM-Yara. Be sure your PostgreSQL server is running and the previous steps have been finished. If you want to use all accessions that are currently in your database, leave the argument void - per default all
will be set. You may have to handover the database user password (--password <pwd>
). The binning output directory is set in the configurator file config.py
(see BINNING_DIR
).
python tactac.py --binning [<src_dir>|default='all'] [--num_bins <num_bins>]
Create a subset of the complete taxonomy database given a root taxid
. Four files will be generated:
- Taxonomy file (
*.tax
), i.e. the taxonomic clade rooted bytaxid
in csv format#taxid, parent_taxid
stored intactac/subset/<taxid>/root_<taxid>.tax
- Taxid table (
*.acc
) with directly assigned accessions, i.e. all accessions assigned to the clade in csv format#taxid,acc1,acc2,...
stored intactac/subset/<taxid>/root_<taxid>.accs
- Sequence file (
*.fasta
) containing all references assigned to the clade - Positional map (
ID2acc
), i.e. a 1-based counter corresponding to an accession and its position within the above generated fasta file
python tactac.py --subtree <taxid:int> --password <pwd>
[1] Federhen, S. (2012). The NCBI Taxonomy database. Nucleic Acids Research, 40(D1), D136–D143. http://doi.org/10.1093/nar/gkr1178