Pipeline for processing Illumina sequencing data consisting of COI PCR amplicons .
- Trims adapters and bases below <20 quality score BBDuk
- Assembles trimmed reads SPAdes
- Detects and extracts target contigs
- Alignssequences to COI reference (Chrysomya putoria (NCBI accession number NC002697) to correct 5'-3' orientation Mafft
- Submit Sequences to BOLD for Identification Bold Retriever
Installing Miniconda + Snakemake
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
conda install -c bioconda -c conda-forge snakemake
Within a working directory:
- Copy Snakefile and pipeline_files directory into working directory
- Create a folder named "fastq" that contains Illumina based raw reads in fastq.gz format
After initializing a conda enviroment containing snakemake, pipeline can be invoked from within working directory
snakemake --use-conda -k
A seperate pipeline was created to generate consensus sequences from reads mapped to a COI reference gene
snakemake -s barcoding_snakefile --use-conda -k
- Python - Programming language
- Conda - Package, dependency and environment management
- Snakemake - Workflow management system
- BioPython - Tools for biological computation
- Mafft - Multiple sequence alignment
- Bold Retriever - Automated BOLD Submission
- BBTools - Adaptor and quality trimming
- SPAdes - De Novo short read assembler
Government of Canada, Agriculture & Agri-Food Canada
This project is licensed under the MIT License - see the LICENSE file for details
-
Snakemake
Köster, Johannes and Rahmann, Sven. “Snakemake - A scalable bioinformatics workflow engine”. Bioinformatics 2012. -
Bold Retriever
Vesterinen, E. J., Ruokolainen, L., Wahlberg, N., Peña, C., Roslin, T., Laine, V. N., Vasko, V., Sääksjärvi, I. E., Norrdahl, K., and Lilley, T. M. (2016) What you need is what you eat? Prey selection by the bat Myotis daubentonii. Molecular Ecology, 25(7), 1581–1594. doi:10.1111/mec.13564 -
Mafft
Nakamura, Yamada, Tomii, Katoh 2018 (Bioinformatics 34:2490–2492) Parallelization of MAFFT for large-scale multiple sequence alignments. (describes MPI parallelization of accurate progressive options) -
SPAdes
Nurk S. et al. (2013) Assembling Genomes and Mini-metagenomes from Highly Chimeric Reads. In: Deng M., Jiang R., Sun F., Zhang X. (eds) Research in Computational Molecular Biology. RECOMB 2013. Lecture Notes in Computer Science, vol 7821. Springer, Berlin, Heidelberg -
BBTools
Brian-JGI (2018) BBTools is a suite of fast, multithreaded bioinformatics tools designed for analysis of DNA and RNA sequence data.https://jgi.doe.gov/data-and-tools/bbtools/ -
FASTQC
Andrews S. (2018). FastQC: a quality control tool for high throughput sequence data. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc -
BOLD
Ratnasingham, S. & Hebert, P. D. N. (2007). BOLD : The Barcode of Life Data System (www.barcodinglife.org). Molecular Ecology Notes 7, 355–364. DOI: 10.1111/j.1471-8286.2006.01678.x