These tools were made for the Naturalis Galaxy instance with a main focus on
metabarcoding analysis.
Either an existing software package is used or new scripts were written for
desired functionalities.
Some inputs for these tools are Naturalis Galaxy specific and depend on
various other software packages used by Naturalis.
- Taxonomic accumulator
- Accepted taxonomic name
- Metadata
- Phyloseq visual reporter
- FastQC analysis
- PRINSEQ analysis
- PRINSEQ trimmer
- CutAdapt trimmer
- Read counter
- FastQ to fastA
Download and install the following software:
* Python3 (apt-get install python3)
* Python3 pip (apt-get install python3-pip)
* Python3 pandas (pip3 install pandas)
* Python3 xlrd (pip3 install xlrd)
* Python3 xlsxwriter (pip3 install xlsxwriter)
* CutAdapt (pip3 install cutadapt)
* PRINSEQ (https://sourceforge.net/projects/prinseq/files/)
* FastQC (https://www.bioinformatics.babraham.ac.uk/projects/download.html#fastqc)
* R (apt-get install r-base)
* R required libraries (apt-get install libcurl4-gnutls-dev & apt-get install libssl-dev)
* R packages (biocLite("phyloseq") & biocLite("optparse"))
* Java (apt-get install default-jre)
* JSON (cpan JSON)
Make sure both PRINSEQ and FastQC are added to the systems PATH (CutAdapt should take care of that automatically).
The galaxyXML files and the bashWrapper files should either be copied to the
Galaxy tool shed folder or be symbolically linked there.
The tool files should be reachable by the bashWrappers, either by being present
in the same folder or by adding the tool files to the systems PATH.
Edit the Galaxy tool_conf.xml file to add the tools to the Galaxy tool shed.
The read quality analysis tools need a small galaxy.yml adjustment to correctly
show their HTML output files. This adjustment concerns the "sanitize_all_html"
option, which should be set to FALSE.
The TaxonomicAccumulator tool will count all occurrences of the identifications for every taxonomic level, for every file used as input.
The tool will handle either a BLAST file, OTU file with old BLAST output, OTU file with new BLAST output, a zip file containing multiple BLAST files or a OTU file with LCA processing added to it.
Sample names can not start with a "#".
All columns in a OTU table should have a header starting with "#".
The AcceptedTaxonomicName tool will utilize either the Global Names API or the Taxonomic Name Resolution Service API to collect accepted taxonomic names based on BLAST identifications.
Global Names is for every kingdom.
TNRS is for plants only.
Sample names can not start with a "#".
All columns in a OTU table should have a header starting with "#".
The MetaData tool will utilize the Naturalis, BOLD and ALA API's to collect meta data such as occurrence status and images based on BLAST identifications or accepted taxonomic names.
Definitions for all occurrence status codes can be found on this page.
Sample names can not start with a "#".
All columns in a OTU table should have a header starting with "#".
The Statistical Analysis tool will utilize the Phyloseq R package to create multiple plots based on a OTU table.
Sample names can not start with a "#".
All columns in a OTU table should have a header starting with "#".
The FastQC tool will do quality control checks on raw sequence data. These checks include summary graphs and tables.
Files in fastq format should always have a .fastq extension.
The PRINSEQ tool will do quality control checks on raw sequence data. These checks include summary graphs and tables.
Files in fasta format should always have a .fasta extension.
Files in fastq format should always have a .fastq extension.
The PRINSEQ tool will trim and discard reads and read sections based on user input and quality thresholds.
Files in fastq format should always have a .fastq extension.
The CutAdapt tool will trim and discard reads and read sections based on user input and quality thresholds.
Files in fastq format should always have a .fastq extension.
The ReadCount tool will count the number of reads in a file or multiple [zip] files and output these numbers to a text file.
Files in fasta format should always have a .fasta extension.
Files in fastq format should always have a .fastq extension.
The FastqToFasta tool will convert one or multiple [zip] fastq files to fasta files using sed.
Files in fastq format should always have a .fastq extension.
- Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P,
Galaxy: A platform for interactive large-scale genome analysis.
Genome Research. 2005; 15(10) 1451-1455. doi: 10.1101/gr.4086505
Galaxy - Python Software Foundation,
Python 3.7+. 2019.
Python - Schmieder R, Edwards R,
Quality control and preprocessing of metagenomic datasets.
Bioinformatics. 2011; 27(6): 863-864. doi: 10.1093/bioinformatics/btr026
PRINSEQ - Martin M,
Cutadapt Removes Adapter Sequences From High-throughput Sequencing Reads.
EMBnet.journal. 2011. doi: 10.14806/ej.17.1.200
CutAdapt - McMurdie PJ, Holmes S,
Phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data.
PLOS One. 2013; 8(4). doi: 10.1371/journal.pone.0061217
Phyloseq - Ratnasingham S, Hebert PDN,
BOLD: The Barcode of Life Data System.
Molecular Ecology Notes. 2007; 7(3). doi: 10.1111/j.1471-8286.2007.01678.x
BOLD - Pyle RL,
Towards a Global Names Architecture: The future of indexing scientific names.
ZooKeys. 2016; 550: 261-281. doi: 10.3897/zookeys.550.10009
Global Names - Boyle B, Hopkins N, Lu Z, Garay JAR, Mozzherin D, Rees T,
The taxonomic name resolution service: an online tool for automated standardization of plant names.
BMC Bioinformatics. 2013; 14(16). doi: 10.1186/1471-2105-14-16
TNRS - Andrews S,
FastQC: A quality control tool for high throughput sequence data.
Babraham Bioinformatics. 2010.
FastQC - Augspurger T, Ayd W, Bartak C, Battiston P, Cloud P, Garcia M,
Python Data Analysis Library.
Pandas - Naturalis API website
- Nederlands Soortenregister website
- Atlas of Living Australia API website
- Boom J, galaxy-tools-naturalis-internship.
GitHub repository: https://github.com/JasperBoom/galaxy-tools-naturalis-internship
Copyright (C) 2018 Jasper Boom
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Affero General Public License version 3 as
published by the Free Software Foundation.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License
along with this program. If not, see <https://www.gnu.org/licenses/>.