This is the main repository for the South Australian Genomics Consortiums' RNA
sequencing quantification pipeline implemented in Nextflow
.
The pipeline involves the following processes:
- BCL2fastq: Convert Illumina BCL files to FastQ
- STAR: Align RNA-sequences to a reference genome using STAR aligner
- Deduplication: Remove PCR duplicates from aligned BAMs
- FeatureCounts: Quantify reads to genomic positions (genes, exons, promoters etc...)
As this is a Nextflow
pipeline, Nextflow
needs to be installed on your system.
Instructions for how to do this are found here. The basic
steps are as follows:
Check Java
version. It needs to be Java 8 or later (up to Java 15)
$ java --version
java 13.0.1 2019-10-15
Java(TM) SE Runtime Environment (build 13.0.1+9)
Java HotSpot(TM) 64-Bit Server VM (build 13.0.1+9, mixed mode, sharing)
Install the Nextflow
executable. This can be installed wherever you like, just ensure that the binary gets moved
to somewhere in your $PATH
$ wget -qO- https://get.nextflow.io | bash
$ mv nextflow /usr/bin # For example
All software in the pipeline is configured using conda
, meaning you simply need
to clone the repository to your system.
$ cd <path>/software
$ git clone --recurse-submodules https://github.com/sagc-bioinformatics/nf-rnaseq-sagc.git
The pipeline is invoked using the following command
$ nextflow run <path>/nf-rnaseq-sagc/main.nf <arguments>
To obtain a help message, simply run the following
$ nextflow run <path>/nf-rnaseq-sagc/main.nf --help
The help page is shown at the bottom of this README
.
Below is a simple script that represents invoking the pipeline for a single-end dataset using SAGC generated data.
#!/usr/bin/env bash
PIPE="<path>/nf-rnaseq-sagc"
nextflow run ${PIPE}/main.nf \
-profile conda,slurm \
--library_type single \
--path_bcl '<path/to/bcl/dir>' \
--index single \
--genome HG38 \
--outdir output-single-end \
--email user.name@sahmri.com \
--partition sahmri_prod_hpc \
--node_list edp-prd-lin-hpc05,edp-prd-lin-hpc06 \
-resume
This script is then able to be run in the background on the HPC (e.g. using screen
) and will handle:
- Installing the pipeline-software using
Conda
(list of software used:conda.yml
) SLURM
job submission
This pipeline has been written with sequence data coming off the SAGC Illumina machines in mind, but can accept custom user data which is in FastQ format. For custom data sets, a simple CSV file needs to be passed that contains the directory paths to the files and the basename of the files. An example CSV is below
path,name
/path/to/dir,sampleA
/path/to/dir,sampleB
/different/path,sampleC
/different/path/again,sampleD
where samples A and B would have the corresponding FastQ files:
sampleA_R1.fastq.gz
SampleA_R2.fastq.gz
SampleB_R1.fastq.gz
SampleB_R2.fastq.gz
The pipeline will build a regular expression using the columns in the CSV, along with the other arguments specific to user-data to load the FastQ files for analysis. See below for arguments relating to non-SAGC sequence files.
Below is an example script for custom fastq data
#!/usr/bin/env bash
PIPE="<path>/nf-rnaseq-sagc"
nextflow run ${PIPE}/main.nf \
-profile conda,slurm \
--library_type paired \
--library_ext '*_R{1,2}.fastq.gz \
--samplesheet '<path/to/samplesheet.csv> \
--index dual \
--genome HG38 \
--outdir output-paired-end \
--email user.name@sahmri.com \
--partition sahmri_prod_hpc \
--node_list edp-prd-lin-hpc05,edp-prd-lin-hpc06 \
-resume
Where samplesheet.csv
is structured like the example above.
==============================================================
SAGC RNA-SEQ NEXTFLOW PIPELINE 0.0.1
==============================================================
A pipeline for RNA-seq quantification via alignment methods.
* STAR: Splice aware alignment
* Umi-tools/Sambamba: BAM deduplication
* Subread FeatureCounts: Quantification
Nextflow Arguments:
-profile <str> Which Nextflow profile to use: SHOULD always be 'conda,slurm'
Arguments: These are mandatory for SAGC datasets
--library_type <str> String indicating 'paired' or 'single' end data (valid: 'paired', 'single')
--index <str> Type of indexing (valid: 'single', 'dual', 'umi')
--path_bcl <str> Directory path to BCL file for sequencing run of interest
--genome <str> Which genome release to use. (Valid: HG38, GRCH37)
--outdir <str> Directory path to output directory. Will be created if it doesn't exist already
--email <str> Your SAHMRI email
--partition <str> SAHMRI HPC partition to use (valid: 'sahmri_prod_hpc', 'sahmri_cancer_hpc')
--node_list <str> SAHMRI HPC nodes to use (valid: sahmri_prod_hpc=edp-prd-lin-hpc0{1,6}, sahmri_cancer_hpc=edp-prd-lin-hpc0{7,8})
Optional Arguments (user provided FastQ data):
--library_ext <str> Regular expression string to match sequence files (E.g. '/path/to/files/*_R{1,2}.fastq.gz')
--samplesheet <str> Path to custom sample sheet in Illumina format (if data is from BCL) or two column CSV; path and file base-name
--umi_ext <str> Regular expression string to match UMI files (E.g. '/path/to/files/*_I1.fastq.gz')
Optional Software Arguments (general):
--fastp_optional_args <str> Quoted string of optional arguments to pass to FastP
--umitools_optional_args <str> Quoted string of optional arguments to pass to Umi-tools
--star_optional_args <str> Quoted string of optional arguments to pass to STAR aligner
--featurecounts_optional_args <str> Quoted string of optional arguments to pass to FeatureCounts
The 'Optional Arguments' should only be provided if the sequence data hasn't been generated by the SAGC.
If the data is custom (i.e. not being created from the BCL files), the '--library_ext' argument is used
to pattern match all FastQ files at the specified location. Which files to match are obtained from the
second column in the custom sample sheet. If the custom data has UMIs, then they can be captured too by
providing the '--umi_ext' argument.
Example custom sample sheet (csv):
path,basename
/path/to/reads,sampleA
/path/to/reads,sampleB
/path/to/reads,sampleC