wf-transcriptomes

This repository contains a nextflow workflow for assembly and annotation of transcripts from Oxford Nanopore cDNA or direct RNA reads. It has been adapted from two existing Snakemake pipelines:

Introduction

This workflow identifies RNA isoforms using either cDNA or direct RNA (dRNA) Oxford Nanopore reads.

Preprocesing

cDNA reads are initially preprocessed by pychopper for the identification of full-length reads, as well as trimming and orientation correction (This step is omitted for direct RNA reads).

Transcript assembly

Reference-aided transcript assembly approach

Full length reads are mapped to a supplied reference genome using minimap2
Transcripts are assembled by stringtie in long read mode (with or without a guide reference annotation) to generate the GFF annotation.
The annotation generated by the pipeline is compared to the reference annotation. using gffcompare

de novo-based transcript assembly (experimental!)

Sequence clusters are generated using isONclust2
- If a reference genome is supplied, cluster quality metrics are determined by comparing
  with clusters generated from a minimap2 alignment.
A consensus sequence for each cluster is generated using spoa
Three rounds of polishing using racon and minimap2 to give a final polished CDS for each gene.
Full-length reads are then mapped to these polished CDS.
Transcripts are assembled by stringtie as for the reference-based approach.
Note: This approach is currently not supported with direct RNA reads.

Fusion gene detection

Fusion gene detection is performed using JAFFA, with the JAFFAL extension for use with ONT long reads.

Differential expression analysis

Differential expression is done using the transcripts output by the workflow.
A non redundant transcriptome is found using the merge function in stringtie.
The reads are then aligned to the transcriptome using minimap2 in a splice-aware manner.
salmon is used for transcript quantification.
R packages edgeR and stageR are used for differential expression analysis.
DEXSeq is then used for differential transcript usage analysis.

Workflow inputs

Directory containing cDNA/direct RNA reads. Or a directory containing subdirectories each with reads from different samples (in fastq/fastq.gz format)
Reference genome in fasta format (required for reference-based assembly).
Optional reference annotation in GFF2/3 format (required for differential expression analysis --de_analysis).
For fusion detection, JAFFAL reference files (see Quickstart)

Quickstart

The workflow uses nextflow to manage compute and software resources, as such nextflow will need to be installed before attempting to run the workflow.

The workflow can currently be run using either Docker, Singularity or conda to provide isolation of the required software. Each method is automated out-of-the-box provided either docker, singularity or conda is installed.

It is not required to clone or download the git repository in order to run the workflow. For more information on running EPI2ME Labs workflows visit out website.

Workflow options

To obtain the workflow, having installed nextflow, users can run:

nextflow run epi2me-labs/wf-transcriptomes --help

to see the options for the workflow.

Download demonstration data

A small test dataset is provided for the purposes of testing the workflow software. It consists of reads, reference, and annotations from human chromosome 20 only. It can be downloaded using:

wget -O test_data.tar.gz https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-isoforms/wf-isoforms_test_data.tar.gz 
tar -xzvf  test_data.tar.gz

Example execution of a workflow for reference-based transcript assembly and fusion detection

OUTPUT=~/output;
nexflow run epi2me-labs/wf-transcriptomes --fastq ERR6053095_chr20.fastq --ref_genome chr20/hg38_chr20.fa --ref_annotation chr20/gencode.v22.annotation.chr20.gtf \
      --jaffal_refBase chr20/ --jaffal_genome hg38_chr20 --jaffal_annotation genCode22" --out_dir outdir -w workspace_dir -profile conda -resume

Example workflow for denovo transcript assembly

OUTPUT=~/output
nextflow run . --fastq test_data/fastq --denovo --ref_genome test_data/SIRV_150601a.fasta  -profile local --out_dir ${OUTPUT} -w ${OUTPUT}/workspace \
--sample sample_id -resume

A full list of options can be seen in nextflow_schema.json. Below are some commonly used ones.

Threshold for including isoforms into interactive table transcript_table_cov_thresh = 50
Run the denovo pipeline denovo = true (default false)
To run the workflow with direct RNA reads --direct_rna (this just skips the pychopper step).

Pychopper and minimap2 can take options via minimap2_opts and pychopper_opts, for example:

When using the SIRV synthetic test data
- minimap2_opts = '-uf --splice-flank=no'
pychopper needs to know which cDNA synthesis kit used
- SQK-PCS109: use pychopper_opts = '-k PCS109' (default)
- SQK-PCS110: use pychopper_opts = '-k PCS110'
- SQK-PCS11: use pychopper_opts = '-k PCS111'
pychopper can use one of two available backends for identifying primers in the raw reads
- nhmmscan pychopper opts = '-m phmm'
- edlib pychopper opts = '-m edlib'

Note: edlib is set by default in the config as it's quite a lot faster. However, it may be less sensitive than nhmmscan.

Fusion detection

JAFFAL from the JAFFA package is used to identify potential fusion transcripts. To get this this working, there are a couple of things that need doing first.

Install JAFFA

to install JAFFA and it's dependencies run the folllowing:

cd wf-transcriptomes/
./subworkflows/JAFFAL/install_jaffa.sh

Prepare JAFFAL reference data

To use pre-processed reference files for the hg38 genome and GENCODE v22 annotation (as used in the JFFAAL paper), do:

mkdir jaffal_data_dir
cd jaffal_data_dir/
wf-transcriptomes/subworkflows/JAFFAL/load_jaffal_references.sh

To use alternative genome and annotation files, they should be prepared as described here

Specifying the location of the JAFFA code and reference directories

--jaffal_dir Full path to the directory made by running install_jaffa.sh as shown above. eg: /home/wf-trnascriptomes/JAFFA

--jaffal_refBase The directory containing the reference data prepared for use with JAFFAL

JAFFAL annotation and genome files

The prepared JAFFAL reference files will look something like hg38_chr20_genCode22.fa. To enable JAFFAL to find these files --jaffal_genome should be set to hg38_chr20 and --jaffal_annotation to genCode22

JAFFAL Notes: g++ must be installed. JAFFAL is not currently working on Mac M1 (osx-arm64 architecture). If there are no fusion transcripts detected, the workflow will terminate with an error at the JAFFAL stage. If this happens, skip the JAFFAL stage by omitting --jaffal_refBase

Differential Expression

Differential Expression requires at least 2 replicates of each sample to compare. You can see an example condition_sheet.tsv in test_data.

Example workflow for differential expression transcript assembly

Download differential expression data set

wget -O differential_expression.tar.gz https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-isoforms/wf-isoforms_differential_expression.tar.gz && tar -xzvf differential_expression.tar.gz

Run the cmd

OUTPUT=~/output;
nexflow run epi2me-labs/wf-transcriptomes --fastq  differential_expression_dataset/fastq --de_analysis \
--ref_genome differential_expression_dataset/hg38_chr20.fa \
--ref_annotation differential_expression_dataset/gencode.v22.annotation.chr20.gtf \
--direct_rna

Workflow outputs

an HTML report document detailing the primary findings of the workflow.
for each sample:
- gffcomapre output directories
- read_aln_stats.tsv - alignment summary statistics
- transcriptome.fas - the assembled transcriptome
- merged_transcritptome.fas - annotated, assembled transcriptome
- jaffal ooutput directories

Fusion detection outputs

in ${out_dir}/jaffal_output_${sample_id} you will find:

jaffa_results.csv - the csv results summary file
jaffa_results.fasta - fusion transcritpt sequences

Differential Expression outputs

dtu_plots.pdf - a pdf with differntial transcript usage plots

edwwlui/wf-transcriptomes