Clean Paired Illumina Reads Workflow

A schematic of the steps in the workflow.

Requirements

Nextflow
Docker or Singularity

Install

git clone https://github.com/gregorysprenger/wf-paired-illumina-read-clean.git

Run Workflow

Example data are included in assets/test_data directory.

nextflow run \
-profile singularity main.nf \
--inpath assets/test_data \
--outpath results

Test data was generated by taking top 1 million lines of SRA data SRR16343585. (Note: This requires SRA toolkit)

fasterq-dump SRR16343585
head -1000000 SRR16343585_1.fastq > test_R1.fastq
head -1000000 SRR16343585_2.fastq > test_R2.fastq
gzip test_R*.fastq

For Aspen Cluster - Set up Singularity PATH

# Add to $HOME/.bashrc
SINGULARITY_BASE=/scicomp/scratch/$USER
export SINGULARITY_TMPDIR=$SINGULARITY_BASE/singularity.tmp
export SINGULARITY_CACHEDIR=$SINGULARITY_BASE/singularity.cache
export NXF_SINGULARITY_CACHEDIR=$SINGULARITY_BASE/singularity.cache
mkdir -pv $SINGULARITY_TMPDIR $SINGULARITY_CACHEDIR

Reload .bashrc

source ~/.bashrc

Load nextflow

module load nextflow

Steps in the workflow

Identifies paired FastQ files in a given path
- Recognized extensions are: fastq.gz, fq.gz
Remove PhiX from reads using bbduk
- Output:
  - Total reads <*_raw.tsv>
  - PhiX reads - <*_phix.tsv>
Adapter clipping and quality trimming using trimmomatic
- Output:
  - Discarded reads and Singletons <*_trimmo.tsv>
Merge verlapping sister reads into singleton reads using flash
- Output:
  - Paired and single reads: <*{R1,R2}.paired.fq.gz>, <*single.fq.gz>
  - Number of overlapping reads <*overlap.tsv>
  - Number of cleaned reads: <*clean-reads.tsv>
Binning of paired reads with kraken 1 and 2
- Output:
  - Summary output <taxonomy{1,2}-reads.tab>
  - Full kraken output <kraken{1,2}.tab.gz>