qcflow-rnaseq

QC and trimming pipeline for RNA-seq reads.
The pipeline is based on the qcflow but includes newer software and is also specifically focused on RNAseq.

The pipeline runs the following:

FastQC for reads quality control
MultiQC for a prettier visualization of reads and alignments
Fastp for general reads statistics and trimming
Star or Hisat for alignment
RSeQC for alignment quality control
how_are_we_stranded_here for library strandedness
Biobloomtools for contamination screening

The pipeline is meant to be executed on the Pawsey-Setonix supercomputer and Nimbus VM. Please get in touch if you want more configurations.

Prepare bloom filters

Contamination screening relies on separate fasta files as input. The screening is executed using cds/cdna sequences from protein coding genes. You need to create a separte cds/fasta file for the sequences you want to check. Please check docs/cds_fast_download.md to find out how efficiently ypu can download fasta files by taxon. See runme_create-bloom.sh for example run.

Workflow options

A script with an example run can be found in runme.sh for local machines (i.e. Nimbus); the batch_scripts/run_main.sbatch contains a batch script for the Pawsey HPC.
The main script involves the following workflow options:

Option	Description	Software
genome-index	Generate genomic index	star,hisat
read-trimming	Trim input reads and output QC	fastp,fastqc,multiqc
reads-qc	Perform QC on input reads	fastqc,multiqc
reads-qc-cont	Perform QC and contamination screening on reads	fastqc,multiqc,biobloomtools
align	Align RNAseq short reads and output counts	star or hisat,multiqc,rnaseqc,featureCount
infer-strandedness	Infer RNAseq library preparation strandedness	how_are_we_stranded_here

Aligner	Description	Software
star	2 pass star aligner default	star
star-plants	2 pass star aligner tuned for plant genomes	star
star-snps	Star consensus aligner	star
hisat	Hisat aligner default	hisat
hisat-highmem	Hisat high-memory aligner	hisat

In order to see the full input options, please use the following

nextflow run -r main ccdmb/qcflow-rnaseq --help
...
Typical local pipeline command:

  nextflow run ./main.nf ...

Input/output options
  --workflow        [string]  Preferred pipeline to run (accepted: genome-index, reads-qc, reads-qc-cont, trim, align, infer-strandedness) [default:align]
  --aligner         [string]  Preferred aligner (accepted: star, star-snps, star-plants, hisat, hisat_highmem) [default: star]
  --output_dir      [string]  Output directory [default: results]
  --genome          [string]  Reference genome
  --genes           [string]  Reference annotation
  --fastq_dir       [string]  Input directory with fastq files

Fastp parameters
  --adapters        [string]  File containing adaters to trim [default: ./data/truseq_adapters.fasta]
  --min_read_length [number]  Minimum read length (-l in fastp) [default: 50.0]
  --qual_phred      [number]  Phred score used for trimming (-q in fastp) [default: 30.0]

STAR index
  --sjOverhang      [integer] sjDbOverhang value to use

Other parameters
  --snps            [string]  Vcf file (uncompressed) with snps to include to indexing. Must be the same regerence as genome
  --library_name    [string]  String intervals to extract library name (ex. LIB1_name - 1,4 extracts LIB1)
  --index_dir       [string]  Path to index (star or hisat)
  --hisat_prefix    [string]  Prefix of hisat index [default: hisat_index]
  --cdna            [string]  Fasta file with cDNA sequences [default: None]
  --strandedness    [string]  RNAseq library strandedness (accepted: RF, FR, unstranded) [default: RF]
  --bbt_filters     [string]  null [default: None]

genome-index

This command can be run with any of the aligner types. star and star-plants will generate a standard STAR index, star-snps generates consensus index. All the hisat aligner options generate the same type of index. Please consider your genome size and chose accordingly between hisat and hisat-highmem. Large genomes such as human or larger require the adequate memory so please use hisat-highmem.
Example run command

nextflow run -resume -profile singularity,pawsey_setonix -r main ccdmb/qcflow-rnaseq \
  --workflow genome-index \
  --aligner [type aligner] \
  --genome "$PWD/genome/Morex_pseudomolecules_v2.fasta" \
  --genes "$PWD/genome/Morex.gtf" \
  --output_dir results

Output: returns the index directory in the same directory of the input genome.

reads-qc

Example run command

nextflow run -resume -profile singularity,pawsey_setonix -r main ccdmb/qcflow-rnaseq \
  --workflow reads-qc \
  --output_dir results \
  --fastq_dir "$PWD/test/*R{1,2}.fastq.gz"

Output

results
   |-reads-qc
   |---multiqc_data

reads-qc-cont

Example run command

nextflow run -resume -profile local -r main ccdmb/qcflow-rnaseq \
  --workflow reads-qc-cont \
  --output_dir results \
  --fastq_dir "$PWD/test/*_R{1,2}.fastq.gz" \
  --bbt_filters "$PWD/biobloom-filters/*"

Output

results
   |-bbt-contamination
   |-reads-qc
   |---multiqc_data

trim

Example run command

nextflow run -resume -profile singularity,pawsey_setonix -r main ccdmb/qcflow-rnaseq \
  --workflow trim \
  --output_dir results \
  --fastq_dir "$PWD/test/*R{1,2}.fastq.gz"

Output

results
   |-multi_qc-reads
   |---multiqc_data
   |-trimmed_reads

infer-strandedness

Example run command

nextflow run -resume -profile singularity,pawsey_setonix -r main ccdmb/qcflow-rnaseq \
  --workflow infer-strandedness \
  --genome "$PWD/genome/Morex_pseudomolecules_v2.fasta" \
  --genes "$PWD/genome/Morex.gtf" \
  --output_dir results \
  --fastq_dir "$PWD/test/*R{1,2}.fastq.gz" \
  --cdna "$PWD/genome/Morex.cds.fasta"

Output

 results
   |-check_strandedness

align

Example run command

nextflow run -resume -profile singularity,pawsey_setonix -r main ccdmb/qcflow-rnaseq \
  --workflow align \
  --aligner star \
  --genome "$PWD/genome/Morex_pseudomolecules_v2.fasta" \
  --genes "$PWD/genome/Morex.gtf" \
  --output_dir results \
  --library_name 1,6 \
  --index_dir "$PWD/index/Morex/" \
  --fastq_dir "$PWD/test/*R{1,2}.fastq.gz" \
  --strandedness RF

In case you are using hisat for alignment, please add the parameter --hisat_prefix [prefix_used].

Output

results
   |-align_rseqc
   |---bam_stat
   |---junction_annotation
   |-----sample
   |-alignements
   |---[aligner]_aligned
   |-----sample
   |-counts
   |-multi_qc-[aligner]
   |---multiqc_data

Multi library alignment

The aligner allows running multiple library alignments. Is samples are provided as NAMELIB1_file.fastq.gz, NAMELIB2_file.fastq.gz.... it's possible to align the reads by grouping them by sample name, provided by the user through the ---library_name option.
Example: If you want to run the samples as LIB1_LIB001.fastq.gz and LIB1_LIB002.fastq.gz together in the alignemnt, you can use ---library_name 1,4. The pipeline will use the characters from 1 to 4 (inclusive) to determine the samples that will be grouped together. If the user leaves the default option, each pair of reads will be processed separately, even if they belong to the same sample.

AAGI-AUS/qcflow-rnaseq

qcflow-rnaseq

Prepare bloom filters

Workflow options

genome-index

reads-qc

reads-qc-cont

trim

infer-strandedness

align

Multi library alignment