/BCS_dada2

Old workflow for running amplicon sequencing bioinformatics on a cluster system

Primary LanguageRGNU General Public License v3.0GPL-3.0

BCS Pipeline Documentation

Introduction

This README serves as a virtual lab notebook. We will document the scripts and the order that they're run in on here. This pipeline is built for use on a server that utilizes SLURM. I'm currently in the process of cleaning up this repo and improving documentation, so expect things to be incomplete but improving.

The repo is in the middle of undergoing a reorganization to where the scripts should be run in numeric order for simplicity. Expect there to be irrelevant bits that haven't been cleaned up yet.

Required software

  • flexbar
  • R 3.4.2 or later
    • DADA2
    • vegan
    • LULU OTU curation package
    • Plotly R package (properly configured for online use)
    • phyloseq
    • stringr
    • ggplot2
    • DESeq2
    • parallel
    • breakaway
  • RDP Classifier
    • Java
  • BLCA classifier
  • swarm2
  • VSEARCH
  • Drive5 python scripts, place in a folder named d5_py within this folder.
  • ...likely more. To be added soon.

Setup

Folder setup

Make a main project folder. Inside make subdirectories for each run (e.g. BCS1, BCS2, BCS3, BCS4, etc). The raw reads (FASTQ or FASTQ.GZ) can go into these subdirectories. Unzip any FASTQ.GZ files to maintain consistency across different folders. In the future, this step doesn't need to be done (.GZ files preferable), but minor changes need to be made to the flexbar scripts to accomodate this. The main directory should also contain a tab-separated value file called key.txt that contains information corresponding library number, adapter index number (sample number from the Illumina run), and primer tag number to ML ID numbers and other sample metadata. Also in the main project folder, make another plaintext file call libraries.txt that contains the names of each of the run subdirectories (e.g. BCS1, BCS2, BCS3, etc), which one name on each line.

Primer barcodes

Make a FASTA file (or multiple FASTA files if multiple barcoding schemes are used) containing the primer index barcodes in the main project folder. This is called ML_barcodes.fasta in these scripts.

Adapter FASTA file

You will need to make a similar FASTA file containing all of the adapters that you wish flexbar to trim for. Here, it is called truseq_adapters.fasta.

Script setup

This repository can be cloned into its own folder inside of the main project folder (name it "scripts" or something). Inside of your scripts folder, make a subdirectory called out_err_files to contain log files and error logs.

Run Order

  1. make_file_root_list.sh
  2. flexbar.sh
  3. flexbar2.sh

Cleanup

Move all unassigned reads into a new subdirectory called unassigned, if any are remaining for whatever reason after running flexbar2.sh. This should be performed automatically by the script now.

R Analysis with DADA2 and RDP

Description

The phyloseq.sh and BCS_phyloseq.R scripts run community analyses and visualization (based on the Plotly tool). You may need to make major changes to this section for your own analysis or configure Plotly if you want to use the existing visualizations. For some reason, the RDP implementation in DADA2 doesn't perform well for us, possibly due to memory or scaling issues. We run RDP outside of R/DADA2 to get around this.

Setup

R scripts can be placed in the same folder as the other scripts (in a folder inside of the folders with all the directories containing the sequenced libraries). Code for renaming files generated by flexbar2.sh will be contained here as well as R code for running DADA2 analysis.

Run Order

  1. rename.sh
  2. filtertrim.sh
  3. removechim.sh
  4. RDP_classify.sh
  5. phyloseq.sh