/snakemake-barcoding-assembly-pipeline

Snakemake Pipeline for processing COI sequencing data

Primary LanguagePythonMIT LicenseMIT

Snakemake Barcoding Assembly Pipeline

Pipeline for processing Illumina sequencing data consisting of COI PCR amplicons .

  1. Trims adapters and bases below <20 quality score BBDuk
  2. Assembles trimmed reads SPAdes
  3. Detects and extracts target contigs
  4. Alignssequences to COI reference (Chrysomya putoria (NCBI accession number NC002697) to correct 5'-3' orientation Mafft
  5. Submit Sequences to BOLD for Identification Bold Retriever

Prerequisites

Installing Miniconda + Snakemake

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
conda install -c bioconda -c conda-forge snakemake

Getting Started

Within a working directory:

  • Copy Snakefile and pipeline_files directory into working directory
  • Create a folder named "fastq" that contains Illumina based raw reads in fastq.gz format

After initializing a conda enviroment containing snakemake, pipeline can be invoked from within working directory

snakemake --use-conda -k

A seperate pipeline was created to generate consensus sequences from reads mapped to a COI reference gene

snakemake -s barcoding_snakefile --use-conda -k

Built With

  • Python - Programming language
  • Conda - Package, dependency and environment management
  • Snakemake - Workflow management system
  • BioPython - Tools for biological computation
  • Mafft - Multiple sequence alignment
  • Bold Retriever - Automated BOLD Submission
  • BBTools - Adaptor and quality trimming
  • SPAdes - De Novo short read assembler

Copyright

Government of Canada, Agriculture & Agri-Food Canada

License

This project is licensed under the MIT License - see the LICENSE file for details

Citations

  • Snakemake
    Köster, Johannes and Rahmann, Sven. “Snakemake - A scalable bioinformatics workflow engine”. Bioinformatics 2012.

  • Bold Retriever
    Vesterinen, E. J., Ruokolainen, L., Wahlberg, N., Peña, C., Roslin, T., Laine, V. N., Vasko, V., Sääksjärvi, I. E., Norrdahl, K., and Lilley, T. M. (2016) What you need is what you eat? Prey selection by the bat Myotis daubentonii. Molecular Ecology, 25(7), 1581–1594. doi:10.1111/mec.13564

  • Mafft
    Nakamura, Yamada, Tomii, Katoh 2018 (Bioinformatics 34:2490–2492) Parallelization of MAFFT for large-scale multiple sequence alignments. (describes MPI parallelization of accurate progressive options)

  • SPAdes
    Nurk S. et al. (2013) Assembling Genomes and Mini-metagenomes from Highly Chimeric Reads. In: Deng M., Jiang R., Sun F., Zhang X. (eds) Research in Computational Molecular Biology. RECOMB 2013. Lecture Notes in Computer Science, vol 7821. Springer, Berlin, Heidelberg

  • BBTools
    Brian-JGI (2018) BBTools is a suite of fast, multithreaded bioinformatics tools designed for analysis of DNA and RNA sequence data.https://jgi.doe.gov/data-and-tools/bbtools/

  • FASTQC
    Andrews S. (2018). FastQC: a quality control tool for high throughput sequence data. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc

  • BOLD
    Ratnasingham, S. & Hebert, P. D. N. (2007). BOLD : The Barcode of Life Data System (www.barcodinglife.org). Molecular Ecology Notes 7, 355–364. DOI: 10.1111/j.1471-8286.2006.01678.x