QC and trimming pipeline for RNA-seq reads.
The pipeline is based on the qcflow but includes newer software and is also specifically focused on RNAseq.
The pipeline runs the following:
- FastQC for reads quality control
- MultiQC for a prettier visualization of reads and alignments
- Fastp for general reads statistics and trimming
- Star or Hisat for alignment
- RSeQC for alignment quality control
- how_are_we_stranded_here for library strandedness
- Biobloomtools for contamination screening
The pipeline is meant to be executed on the Pawsey-Setonix supercomputer and Nimbus VM. Please get in touch if you want more configurations.
Contamination screening relies on separate fasta files as input. The screening is executed using cds/cdna sequences from protein coding genes. You need to create a separte cds/fasta file for the sequences you want to check. Please check docs/cds_fast_download.md
to find out how efficiently ypu can download fasta files by taxon. See runme_create-bloom.sh
for example run.
A script with an example run can be found in runme.sh
for local machines (i.e. Nimbus); the batch_scripts/run_main.sbatch
contains a batch script for the Pawsey HPC.
The main script involves the following workflow options:
Option | Description | Software |
---|---|---|
genome-index | Generate genomic index | star,hisat |
read-trimming | Trim input reads and output QC | fastp,fastqc,multiqc |
reads-qc | Perform QC on input reads | fastqc,multiqc |
reads-qc-cont | Perform QC and contamination screening on reads | fastqc,multiqc,biobloomtools |
align | Align RNAseq short reads and output counts | star or hisat,multiqc,rnaseqc,featureCount |
infer-strandedness | Infer RNAseq library preparation strandedness | how_are_we_stranded_here |
Aligner | Description | Software |
---|---|---|
star | 2 pass star aligner default | star |
star-plants | 2 pass star aligner tuned for plant genomes | star |
star-snps | Star consensus aligner | star |
hisat | Hisat aligner default | hisat |
hisat-highmem | Hisat high-memory aligner | hisat |
In order to see the full input options, please use the following
nextflow run -r main ccdmb/qcflow-rnaseq --help
...
Typical local pipeline command:
nextflow run ./main.nf ...
Input/output options
--workflow [string] Preferred pipeline to run (accepted: genome-index, reads-qc, reads-qc-cont, trim, align, infer-strandedness) [default:align]
--aligner [string] Preferred aligner (accepted: star, star-snps, star-plants, hisat, hisat_highmem) [default: star]
--output_dir [string] Output directory [default: results]
--genome [string] Reference genome
--genes [string] Reference annotation
--fastq_dir [string] Input directory with fastq files
Fastp parameters
--adapters [string] File containing adaters to trim [default: ./data/truseq_adapters.fasta]
--min_read_length [number] Minimum read length (-l in fastp) [default: 50.0]
--qual_phred [number] Phred score used for trimming (-q in fastp) [default: 30.0]
STAR index
--sjOverhang [integer] sjDbOverhang value to use
Other parameters
--snps [string] Vcf file (uncompressed) with snps to include to indexing. Must be the same regerence as genome
--library_name [string] String intervals to extract library name (ex. LIB1_name - 1,4 extracts LIB1)
--index_dir [string] Path to index (star or hisat)
--hisat_prefix [string] Prefix of hisat index [default: hisat_index]
--cdna [string] Fasta file with cDNA sequences [default: None]
--strandedness [string] RNAseq library strandedness (accepted: RF, FR, unstranded) [default: RF]
--bbt_filters [string] null [default: None]
This command can be run with any of the aligner types. star
and star-plants
will generate a standard STAR index, star-snps
generates consensus index. All the hisat aligner options generate the same type of index. Please consider your genome size and chose accordingly between hisat
and hisat-highmem
. Large genomes such as human or larger require the adequate memory so please use hisat-highmem
.
Example run command
nextflow run -resume -profile singularity,pawsey_setonix -r main ccdmb/qcflow-rnaseq \
--workflow genome-index \
--aligner [type aligner] \
--genome "$PWD/genome/Morex_pseudomolecules_v2.fasta" \
--genes "$PWD/genome/Morex.gtf" \
--output_dir results
Output: returns the index directory in the same directory of the input genome.
Example run command
nextflow run -resume -profile singularity,pawsey_setonix -r main ccdmb/qcflow-rnaseq \
--workflow reads-qc \
--output_dir results \
--fastq_dir "$PWD/test/*R{1,2}.fastq.gz"
Output
results
|-reads-qc
|---multiqc_data
Example run command
nextflow run -resume -profile local -r main ccdmb/qcflow-rnaseq \
--workflow reads-qc-cont \
--output_dir results \
--fastq_dir "$PWD/test/*_R{1,2}.fastq.gz" \
--bbt_filters "$PWD/biobloom-filters/*"
Output
results
|-bbt-contamination
|-reads-qc
|---multiqc_data
Example run command
nextflow run -resume -profile singularity,pawsey_setonix -r main ccdmb/qcflow-rnaseq \
--workflow trim \
--output_dir results \
--fastq_dir "$PWD/test/*R{1,2}.fastq.gz"
Output
results
|-multi_qc-reads
|---multiqc_data
|-trimmed_reads
Example run command
nextflow run -resume -profile singularity,pawsey_setonix -r main ccdmb/qcflow-rnaseq \
--workflow infer-strandedness \
--genome "$PWD/genome/Morex_pseudomolecules_v2.fasta" \
--genes "$PWD/genome/Morex.gtf" \
--output_dir results \
--fastq_dir "$PWD/test/*R{1,2}.fastq.gz" \
--cdna "$PWD/genome/Morex.cds.fasta"
Output
results
|-check_strandedness
Example run command
nextflow run -resume -profile singularity,pawsey_setonix -r main ccdmb/qcflow-rnaseq \
--workflow align \
--aligner star \
--genome "$PWD/genome/Morex_pseudomolecules_v2.fasta" \
--genes "$PWD/genome/Morex.gtf" \
--output_dir results \
--library_name 1,6 \
--index_dir "$PWD/index/Morex/" \
--fastq_dir "$PWD/test/*R{1,2}.fastq.gz" \
--strandedness RF
In case you are using hisat for alignment, please add the parameter --hisat_prefix [prefix_used]
.
Output
results
|-align_rseqc
|---bam_stat
|---junction_annotation
|-----sample
|-alignements
|---[aligner]_aligned
|-----sample
|-counts
|-multi_qc-[aligner]
|---multiqc_data
The aligner allows running multiple library alignments. Is samples are provided as NAMELIB1_file.fastq.gz
, NAMELIB2_file.fastq.gz
.... it's possible to align the reads by grouping them by sample name, provided by the user through the ---library_name
option.
Example: If you want to run the samples as LIB1_LIB001.fastq.gz
and LIB1_LIB002.fastq.gz
together in the alignemnt, you can use ---library_name 1,4
. The pipeline will use the characters from 1 to 4 (inclusive) to determine the samples that will be grouped together. If the user leaves the default option, each pair of reads will be processed separately, even if they belong to the same sample.