/taxalogue

taxalogue - a toolkit to create comprehensive CO1 reference databases

Primary LanguageRubyGNU General Public License v3.0GPL-3.0

taxalogue

taxalogue collects DNA sequences from several online sources (BOLD, GenBank & GBOL) and combines them to a reference database. The reference database is useable in the taxonomic assignment step of metabarcoding analyses. Taxonomic incongruencies between the different data sources can be harmonized with respect to available taxonomies. Various filtering options are available regarding sequence quality or metadata information.


Table of contents


Prerequisites

You need to have ruby installed. taxalogue is tested for ruby versions 2.6.6 - 3.2.2. If you use Ubuntu 22+ you need to have at least ruby version 3.1.0. For detailed instructions go to https://www.ruby-lang.org/en/documentation/installation/

Additionally quite a bit of disk space is needed. The concrete size depends on the used taxon, but 40 Gigabytes of free disk space should be considered as a minimum requirement.

At last, taxalogue needs time. Especially the sequence download of species-rich taxa will take a while (e.g.: Arthropoda download takes more than a day). This is mainly because of download speed throttling by the source databases, but also because of occasional waiting times to not overload the providers.


Getting started

  1. Get the latest release

     wget https://github.com/nwnoll/taxalogue/archive/refs/tags/v1.1.0.tar.gz
    
  2. Unzip the file to a location of your choice

     tar xzf v1.1.0.tar.gz
    
  3. Change to the directory where you unzipped the source code

     cd taxalogue-1.1.0
    
  4. Install all the dependencies

     bundle lock --update        
     bundle install
    
     # If you do get an error message like: Could not find 'bundler' (2.4.10) required by your Gemfile.lock. (Gem::GemNotFoundException)
     
     # try the following (remove the '#'):
     # gem install bundler:2.4.10
     # bundle update --bundler
     # bundle lock --update
     # bundle install
    
  5. Download records from up to three different source databases, for BOLD manually download the datapackages.

     # Since BOLD now offers to download a snapshot of the current database, it is recommended to use it
     # The datapackages are listed at https://boldsystems.org/index.php/datapackages
     # Choose the most current snapshot and download "Data package (tar.gz compressed)"
     # For this step you need to be logged in, and therefore an account is needed.
     # After you have downloaded the datapackage, extract the .tsv file
     
     # If you want to create a database for e.g., Artrhopoda from GBOL, GenBank (here the MIDORI version) and you already downloaded the aforementioned 
     # BOLD datapackage, you can use the command below to download data from MIDORI and GBOL
     bundle exec ruby taxalogue.rb --taxon Arthropoda download --gbol --midori
    

If this is the first start of taxalogue, it starts by downloading taxonomies from NCBI and GBIF. After download, the taxonomies will be imported into a SQL database. The whole process might take a little less than 2 hours. If the setup for the taxonomies is complete, the download of the actual sequences begins. As mentioned earlier the duration of the download depends on the chosen taxon.

  1. Classify the records. Change the directory pathes according to your own BOLD release location (.tsv).

     bundle exec ruby taxalogue.rb --taxon Arthropoda classify --midori --gbol --bold_release /path/to/bold_release.tsv
    
  2. Have a look at the results:

     ├── results/
     │   ├── Arthropoda-20210317T1604/
     │   │   ├── merged_output.tsv
     │   │   ├── merged_output.fas
     │   │   ├── merged_comparison.tsv
     │   │   ├── midori_download_info.txt
     │   │   ├── gbol_download_info.txt
     │   │   ├── bold_download_info.txt
     │   │   ├── taxalogue.txt
    
  3. Done!

If this is the first start of taxalogue, it has to download and setup taxonomies. These are needed to correctly assign taxonomic lineage information in case of missing or erroneous data. Since these are several Gigabyte of data, this might take while. The download time for the actual sequence data from BOLD, GenBank and GBOL depends on the taxon, but will also take quite some time. After the data are downloaded, the taxonomic lookup, mapping and formatting starts and the output files get created.

taxalogue without any options will by default create a reference database for the taxon Arthropoda. The default marker is co1 and also the only possible marker choice at the moment. Additional marker are planned for coming updates. The default taxonomy is from NCBI.

The output files are stored in the results folder, depending on your taxon choice and time there will be a folder called something like Arthropoda-20210317T1604. The first part is the used taxon and the second part shows date and time of the pogram start.

The download/contaminants folder contains sequences of frequent unwanted HTS generated by-products. These files can be used as separate reference databases to exclude these most probably unwanted reads from further analysis. Wolbachia is an endoparasite of many insects and could lead to misclassification if not excluded.

The main output files are merged_output.fas and merged_output.tsv if dereplication is not activated;Arthropoda_derep_all_output.tsv and Arthropoda_derep_all_output.fas otherwise. These will represent the reference databases. The tsv has some additional metadata information, e.g. the location where the specimen has been found. Since the different source databases contain records that have misspelings and or are considered as synonyms, there might be changes to the taxonomic information compared to the original. These are shown in merged_comparison.tsv or Arthropoda_derep_all_comparison.tsv. Go to the taxonomy section for more information.


Additional filtering

Requires vsearch to remove possible contaminants

## replace spaces with underscores, vsearch does not allow spaces
ruby scripts/underscore_fasta.rb results/Arthropoda-20210317T1604/Arthropoda_derep_all_output.fas > Arthropoda_derep_all_output_uc.fas

## remove sequences with stop codons and correct reverse complements, --genetic code 5 for invertebrates
ruby scripts/stop_codon_filter_and_rc_correcton.rb --input Arthropoda_derep_all_output_uc.fas --output Arthropoda_derep_all_output_uc_STPf.fas --genetic_code 5 --filter_info STPf.tsv

## remove gaps from sequences
ruby scripts/degap_fasta.rb Arthropoda_derep_all_output_uc_STPf.fas > Arthropoda_derep_all_output_uc_STPf_dg.fas

## remove possible contaminants
cat results/Arthropoda-20210317T1604/Homo_sapiens_output.fas results/Arthropoda-20210317T1604/Wolbachia_output.fas > contaminants.fas
vsearch --usearch_global Arthropoda_derep_all_output_uc_STPf_dg.fas --db contaminants.fas --maxaccepts 1 --maxrejects 0 --id 0.9 --dbmask none --qmask none --threads 32 --blast6out contaminants_search.b6 --matched matched_contaminants.fas --notmatched Arthropoda_derep_all_output_uc_STPf_dg_CONTf.fas

## remove sequences with more than 3 Ns
ruby scripts/filter_Ns.rb Arthropoda_derep_all_output_uc_STPf_dg_CONTf.fas Nf3_discarded.fas 3 > Arthropoda_derep_all_output_uc_STPf_dg_CONTf_Nf3.fas

## remove sequences that have less than 400 bp and more than 1569 bp
ruby scripts/length_filter.rb Arthropoda_derep_all_output_uc_STPf_dg_CONTf_Nf3.fas 400 1569 STPf_dg_CONTf_Nf3_Lf400_1569.tsv > Arthropoda_derep_all_output_uc_STPf_dg_CONTf_Nf3_Lf400_1569.fas

## some sequences had missing information, e.g. order info was missing => sequence taxon info was removed until class level
ruby scripts/remove_lower_than_missing_taxon_info.rb Arthropoda_derep_all_output_uc_STPf_dg_CONTf_Nf3_Lf400_1569.fas > Arthropoda_derep_all_output_uc_STPf_dg_CONTf_Nf3_Lf400_1569_TaxR.fas

Citation

Noll NW, Scherber C, Schäffler L. 2023. taxalogue: a toolkit to create comprehensive CO1 reference databases. PeerJ 11:e16253 https://doi.org/10.7717/peerj.16253

If you used the --midori option, please also cite: Leray, M., Knowlton, N., & Machida, R. J. (2022). MIDORI2: A collection of quality controlled, preformatted, and regularly updated reference databases for taxonomic assignment of eukaryotic mitochondrial sequences. Environmental DNA, 4, 894–907. https://doi.org/10.1002/edn3.303

If you used VSEARCH for additional filtering, please also cite: Rognes T, Flouri T, Nichols B, Quince C, Mahé F. 2016. VSEARCH: a versatile open source tool for metagenomics. PeerJ 4:e2584 https://doi.org/10.7717/peerj.2584


Basics

This section explains the basic functionalities of taxalogue. Additionally, suggested pipelines and useful tips will be shown.


Basic Usage

The basic usage looks like:

bundle exec ruby taxalogue.rb [general_options] [subcommand [subcommand_options]]

To get an overview of all general options and all available subcommands use --help

bundle exec ruby taxalogue.rb --help

If you want to see the available options for a subcommand use subcommand --help:

bundle exec ruby taxalogue.rb filter --help

If you want to specify general options (--taxon, --markers, --fast_run, num_threads, version) you have to do it right after bundle exec ruby taxalogue.rb:

bundle exec ruby taxalogue.rb --taxon Orthoptera

All the subcommands and the subcommand options will then come right after it:

bundle exec ruby taxalogue.rb --taxon Orthoptera download --gbol

The subcommands associated subcommand options should be listed right after the subcommand:

bundle exec ruby taxalogue.rb --taxon Orthoptera classify --gbol filter --taxon_rank species --max_N 2 region --country "Germany"

The last command would classify downloaded sequences from GBOL for the taxon Orthoptera with sequences from GBOL. It would only contain sequences with at least species information, that have a maximum of 2 Ns and belong to specimens collected in Germany.


download

This command only downloads sequences for the specified taxon and does not do any other thing like classification or filtering etc.. Options can be used to get sequences from all available source databases (BOLD, GBOL and GenBank) or just for example from GBOL and BOLD.

The results will be written into the downloads folder:

    ├── downloads/
    │   ├── BOLD/
    │   │   ├── Arthropoda-20210317T1604/
    │   ├── GBOL/
    │   │   ├── GBOL_Dataset_Release-20210128/
    │   ├── NCBIGENBANK/
    │   │   ├── release245/
    │   │   │   ├── inv/
    │   │   │   ├── mam/

Examples:

## download co1 sequences for Orthoptera with sequences from GBOL
bundle exec ruby taxalogue.rb -t Orthoptera download --gbol

classify

If you already downloaded sequences download subcommand, you could classify these downloads without downloading it again. This is useful if you for example want another version of the database that only allow sequences that are determined until species level. Or you don't want to allow any Ns in the sequences. You could also try a different taxonomy.

If you already downloaded sequences for Arthropoda, you could also generate a subset of only Hymenoptera without having to download Hymenoptera again, since they are already available through the older download. This of course does only work if you already have downloades sequences for your specified taxon or for a higher taxon.

The results will be written into the results folder:

    ├── results/
    │   ├── Arthropoda-20210317T1604/

Examples:

## classify co1 sequences for Hymenoptera with the latest downloads from GBOL
bundle exec ruby taxalogue.rb -t Orthoptera classify --gbol

## classify co1 sequences for Hymenoptera with the latest downloads from GBOL and which are determined until the species rank
bundle exec ruby taxalogue.rb -t Hymenoptera classify --gbol filter --taxon_rank species

## classify co1 sequences for Arthropoda with the latest downloads from only GBOL and use the GBIF Bacbkbone Taxonomy to get accepted taxon names
bundle exec ruby taxalogue.rb -t Hymenoptera download --gbol taxonomy --gbif

## classify co1 sequences for Arthropoda with the download folder you specified
bundle exec ruby taxalogue.rb -t Arthropoda \
classify --gbol_dir /home/user/taxalogue/downloads/GBOL/GBOL_Dataset_Release-20210128 \
--bold_dir /home/user/taxalogue/downloads/BOLD/Arthropoda-20210902T1201 \
--genbank_dir /home/user/taxalogue/downloads/NCBIGENBANK/release245

output

    ├── results/                                         * All results will be written in this folder 
    │   ├── Orthoptera-20210903T1100/                    * This is the folder of your current request
    │   │   ├── contaminants/                            * By default possible inveterebrate contaminants are also downloaded
    │   │   │   ├── Wolbachia_output.tsv            
    │   │   │   ├── Wolbachia_output.fas
    │   │   │   ├── Homo_sapiens_output.tsv
    │   │   │   ├── Homo_sapiens_output.tsv
    │   │   ├── gbol_download_info.txt                   * Shows information about successes and failures during the GBOL download 
    │   │   ├── bold_download_info.txt                   * Shows information about successes and failures during the BOLD download
    │   │   ├── ncbi_genbank_download_info.txt           * Shows information about successes and failures during the GenBank download
    │   │   ├── taxalogue.txt                            * Consists of your specified parameters for this taxalogue run
    │   │   ├── Orthoptera_derep_all_output.tsv          * Dereplicated TSV file for Orthoptera with sequences from GBOL, BOLD and GenBank
    │   │   ├── Orthoptera_derep_all_output.fas          * Dereplicated fasta file for Orthoptera with sequences from GBOL, BOLD and GenBank
    │   │   ├── Orthoptera_derep_all_comparison.tsv      * Dereplicated TSV file for Orthoptera with sequences from GBOL, BOLD and GenBank
    │   │   ├── Orthoptera_derep_all_qiime2_taxonomy.txt
    │   │   ├── Orthoptera_derep_all_qiime2_taxonomy.fas
    │   │   ├── Orthoptera_derep_all_kraken2.fas
    │   │   ├── Orthoptera_derep_all_dada2_taxonomy.fas
    │   │   ├── Orthoptera_derep_all_dada2_species.fas
    │   │   ├── Orthoptera_derep_all_sintax.fas

Modifying the default config

If taxalogue is called without any options it will only use the default values and those that have been specified in the default_config.yaml file. The config file can be adopted to your preferences.

:taxon: Arthropoda
:taxon_rank: phylum
:markers: co1
:fast_run: false
:num_threads: 5
:num_cores: 5
:taxonomy:
  :ncbi: true
:derep:
  :last_common_ancestor: false
  :random: false
  :discard: false
  :no_derep: true
:output:
  :table: true
  :fasta: true
  :comparison: true
  :qiime2: false
  :kraken2: false
  :dada2_taxonomy: false
  :dada2_species: false
  :sintax: false

General options

--taxon

The taxa you are able to use depends on the used taxonomy. Taxa names have to be provided in their latinized form without authorship information. At the moment only taxa names for standard ranks are allowed (species, genus, family, order, class, phylum and kingdom). If the taxon is not available, check for any misspelings and if it belongs to the allowed ranks. Available taxa are listed at NCBI Taxonomy or GBIF.

--markers

Currently only co1. default: co1

--fast_run

Accellerates Taxon comparison. Turn it on with --fast_run true. default: false

--num_threads

Number of threads for downloads. default: 5

--num_cores

Number of cores for classification. default: 5

--version

Shows the used version

--help

Lists all general options and shows available subcommands. If --help is used after a subcommand than it will show all options for that subbcomand


More examples

Choose a taxon

## defaults to Arthropoda
bundle exec ruby taxalogue.rb

## choose Arthropoda explicitly
bundle exec ruby taxalogue.rb -t Arthropoda

## choose Insecta
bundle exec ruby taxalogue.rb -t Insecta

Choose a taxonomy

## print available options
bundle exec ruby taxalogue.rb taxonomy -h

## use GBIF backbone + additional datasets provided by GBIF for taxonomic mapping
bundle exec ruby taxalogue.rb -t Arthropoda classify --gbol taxonomy --gbif

## use only the GBIF Backbone taxonomy
bundle exec ruby taxalogue.rb -t Arthropoda classify --gbol taxonomy --gbif

## use NCBI Taxonomy, default
bundle exec ruby taxalogue.rb -t Arthropoda classify --gbol taxonomy --ncbi

## use GBIF Backbone and allow synonyms
bundle exec ruby taxalogue.rb -t Arthropoda classify --gbol taxonomy --gbif --allow_syonyms

## Disable taxonomic harmonization
bundle exec ruby taxalogue.rb -t Arthropoda classify --gbol taxonomy --unmapped

Filter sequences

## print available options
bundle exec ruby taxalogue.rb filter -h

## no Ns or gaps allowed
bundle exec ruby taxalogue.rb -t Arthropoda filter --max_N 0 --max_G 0

## sequences have a minimum length of 300 nucleotides
bundle exec ruby taxalogue.rb -t Arthropoda filter --min_length 300

## sequences are between 100 and 900 nucleotides long and have at maximum 9 Ns and 3 gaps
bundle exec ruby taxalogue.rb -t Arthropoda filter --min_length 100 --max_length 900 --max_N 9 --max_G 3

Choose sequences by countries or continents

If you want filter the sequences by countries or continents, then only sequences are considered that have this information present. Sequences without this information get discared

## print available options
bundle exec ruby taxalogue.rb region -h

## only sequences from Germany
bundle exec ruby taxalogue.rb -t Arthropoda region -c Germany

## only sequences from Germany, France and Austria
bundle exec ruby taxalogue.rb -t Arthropoda region -c "Germany;France;Austria"

## only sequences from North America
bundle exec ruby taxalogue.rb -t Arthropoda region -k "North America"

## only sequences from North America and Ecuador
bundle exec ruby taxalogue.rb -t Arthropoda region -k "North America" -c Ecuador

## only sequences from North America and Europe
bundle exec ruby taxalogue.rb -t Arthropoda region -k "North America;Europe"

Choose sequences by biogegraphic realms

If you want filter the sequences by biogeographic realms, then only sequences are considered that have latitude and longitude information. Sequences without this information get discarded

## show all available realms
bundle exec ruby taxalogue.rb -B

## sequences from the Palearctic
bundle exec ruby taxalogue.rb -t Arthropoda region -b Palearctic

## sequences from the Palearctic and Nearctic
bundle exec ruby taxalogue.rb -t Arthropoda region -b "Palearctic;Nearctic"

Choose sequences by terrestrial eco zones

If you want filter the sequences by terrestrial eco zones, then only sequences are considered that have latitude and longitude information. Sequences without this information get discared

## show all available terrestrial eco regions
bundle exec ruby taxalogue.rb -E

## sequences from Western European broadleaf forests 
bundle exec ruby taxalogue.rb -t Arthropoda region -e "Western European broadleaf forests"

## sequences from Zambezian coastal flooded savanna and Zambezian flooded grasslands
bundle exec ruby taxalogue.rb -t Arthropoda region -e "Zambezian coastal flooded savanna;Zambezian flooded grasslands"

Choose sequences by custom shape file

If a custom shape file should be used, you need to specify these 3 parameters:

  • --custom_shapefile => expects a ESRI shape file. It also expects files with .shx and .dbf extension to be in the same folder
  • --custom_shapefile_attribute => attribute name that should be used
  • --custom_shapefile_values => values of the attribute on which the filtering should be based

An example run could look like this:

bundle exec ruby taxalogue.rb region --custom_shapefile downloads/SHAPEFILES/fada_regions/fadaregions.shp --custom_shapefile_attribute name --custom_shapefile_values "Nearctic;Palaearctic"

Since the coordinates from the downloaded sequences are most likely based on the WGS84 coordinate reference system, the custom shapefiles are also expected to be based on WGS84