RNAsum

RNA-seq reporting workflow designed to post-process, summarise and visualise an output from bcbio-nextgen RNA-seq or Dragen RNA pipelines. Its main application is to complement genome-based findings from umccrise pipeline and to provide additional evidence for detected alterations.

Installation
Workflow
Reference data
- External reference cohorts
- Internal reference cohort
Input data
- WTS
  - bcbio-nextgen
  - Dragen RNA
- WGS
Usage
Docker

Installation

Run the following to create a directory "rnasum" and install into it

mkdir rnasum
cd rnasum
source <(curl -s https://raw.githubusercontent.com/umccr/RNAseq-Analysis-Report/master/install.sh)

It will generate load_rnasum.sh script that can be sourced to load the rnasum environment:

source load_rnasum.sh

Workflow

The pipeline consist of five main components illustrated and breifly described below. See the workflow.md for the complete description of the data processing workflow.

Collect patient sample WTS data from bcbio-nextgen RNA-seq or Dragen RNA pipeline including per-gene read counts and gene fusions.
Add expression data from reference cohorts to get an idea about expression levels of genes of interest in other cancer patient cohorts. The read counts are normalised, transformed and converted into a scale that allows to present the sample's expression measurements in the context of the reference cohorts.
Feed in genome-based findings from whole-genome sequencing (WGS) data to focus on genes of interest and provide additional evidence for dysregulation of mutated genes, or genes located within detected structural variants (SVs) or copy-number (CN) altered regions. The RNAsum pipeline is designed to be compatible with WGS patient report based on umccrise pipeline output.
Collate results with knowledge derived from in-house resources and public databases to provide additional source of evidence for clinical significance of altered genes, e.g. to flag variants with clinical significance or potential druggable targets.
The final product is a html-based interactive report with searchable tables and plots presenting expression levels of the genes of interest genes. The report consist of several sections described in report_structure.md.

Reference data

The reference expression data is availbale for 33 cancer types and were derived from external (TCGA) and internal (UMCCR) resources.

External reference cohorts

In order to explore expression changes in queried sample we have built a high-quality pancreatic cancer reference cohort.

Depending on the tissue from which the patient's sample was taken, one of 33 cancer datasets from TCGA can be used as a reference cohort for comparing expression changes in genes of interest in investigated sample. Additionally, 10 samples from each of the 33 datasets were combined to create Pan-Cancer dataset, and for some cohorts extended sets are also available. All available datasets are listed in TCGA projects summary table. These datasets have been processed using methods described in TCGA-data-harmonization repository. The dataset of interest can be specified by using one of the TCGA project IDs (Project column) for the --dataset argument in RNAseq_report.R script (see Arguments section).

Note

Each dataset was cleaned based on the quality metrics provided in the Merged Sample Quality Annotations file merged_sample_quality_annotations.tsv from TCGA PanCanAtlas initiative webpage (see TCGA-data-harmonization repository for more details, including sample inclusion criteria).

Internal reference cohort

The publically available TCGA datasets are expected to demonstrate prominent batch effects when compared to the in-house WTS data due to differences in applied experimental procedures and analytical pipelines. Moreover, TCGA data may include samples from tissue material of lower quality and cellularity compared to samples processed using local protocols. To address these issues, we have built a high-quality internal reference cohort processed using the same pipelines as input data (see Data pre-processing section on the workflow page).

This internal reference set of 40 pancreatic cancer samples is based on WTS data generated at UMCCR and processed with bcbio-nextgen RNA-seq pipeline to minimise potential batch effects between investigated samples and the reference cohort and to make sure the data are comparable. The internal reference cohort assembly is summarised in Pancreatic-data-harmonization repository.

Note

The are two rationales for using the internal reference cohort:

In case of pancreatic cancer samples this cohort is used (I) in batch effects correction, as well as (II) as a reference point for comparing per-gene expression levels observed in the investigated single-subject data and data from other pancreatic cancer patients.
In case of samples from any cancer type the data from the internal reference cohort is used in batch effects correction procedure performed to minimise technical-related variation in the data.

Input data

The pipeline accepts WTS data processed by bcbio-nextgen RNA-seq or Dragen RNA pipeline. Additionally, the WTS data can be integrated with WGS-based data processed using umccrise pipeline. In the latter case, the genome-based findings from corresponding sample are incorporated into the report and are used as a primary source for expression profiles prioritisation.

WTS

The only required WTS input data are read counts provided in quantification file from either bcbio-nextgen RNA-seq or Dragen RNA pipeline.

bcbio-nextgen

The read counts are provided within quantification file from kallisto (see example abundance.tsv file and its description). The per-transcript abundances are reported in estimated counts (est_counts) and in Transcripts Per Million (tpm), which are then converted to per-gene estimates. Additionally, a list of fusion genes detected by arriba and pizzly can be provided (see example fusions.tsv and test_sample_WTS-flat.tsv).

Table below lists all input data accepted in the pipeline:

Input file	Tool	Example	Required
Quantified abundances of transcripts	kallisto	abundance.tsv	Yes
List of detected fusion genes	arriba pizzly	fusions.tsv test_sample_WTS-flat.tsv	No
Plots of detected fusion genes using arriba	arriba	fusions.pdf	No

These files are expected to be organised following the folder structure below

|
|____<SampleName>
  |____kallisto
  | |____abundance.tsv
  |____pizzly
  | |____<SampleName>-flat.tsv
  |____arriba
    |____fusions.pdf
    |____fusions.tsv

Note

Fusion genes detected by pizzly are expected to be listed in the flat table. By default two output tables are provided: (1) <sample_name>-flat.tsv listing all gene fusion candidates and (2) <sample_name>-flat-filtered.tsv listing only gene fusions remaining after filtering step. However, this workflow makes use of gene fusions listed in the unfiltered pizzly output file (see example test_sample_WTS-flat.tsv) since it was noted that some genuine fusions (based on WGS data and curation efforts) are excluded in the filtered pizzly output file.

Dragen RNA

The read counts are provided within quantification file from salmon (see example TEST.quant.sf file and its description). The per-transcript abundances are reported in estimated counts (NumReads ) and in Transcripts Per Million (TPM ), which are then converted to per-gene estimates. Additionally, a list of fusion genes can be provided (see example TEST.fusion_candidates.final).

Table below lists all input data accepted in the pipeline:

Input file	Tool	Example	Required
Quantified abundances of transcripts	salmon	TEST.quant.sf	Yes
List of detected fusion genes	Dragen RNA	TEST.fusion_candidates.final	No

These files are expected to be organised following the folder structure below

|
|____<SampleName>
  |____<SampleName>quant.sf
  |____<SampleName>.fusion_candidates.final

WGS

The following umccrise output files are accepted as input data in the pipeline:

Input file	Tool	Example	Required
List of detected and annotated single-nucleotide variants (SNVs) and indels	PCGR	test_subject__test_sample_WGS-somatic.pcgr.snvs_indels.tiers.tsv	No
List of genes involved in CN altered regions	PURPLE	test_subject__test_sample_WGS.purple.gene.cnv	No
List of genes involved in SV regions	Manta	test_subject__test_sample_WGS-sv-prioritize-manta-pass.tsv	No

These files are expected to be organised following the folder structure below

|
|____umccrised
  |____<SampleName>
    |____pcgr
    | |____<SampleName>-somatic.pcgr.snvs_indels.tiers.tsv
    |____purple
    | |____<SampleName>.purple.gene.cnv
    |____structural
      |____<SampleName>-manta.tsv

Usage

To run the pipeline execure the RNAseq_report.R script. This script catches the arguments from the command line and passes them to the RNAseq_report.Rmd script to produce the interactive HTML report.

Arguments

Argument	Description	Required
--sample_name	The name of the sample to be analysed and reported	Yes
--bcbio_rnaseq	Location of the results folder from bcbio-nextgen RNA-seq pipeline	Yes*
--dragen_rnaseq	Location of the results folder from Dragen RNA pipeline	Yes*
--report_dir	Desired location for the report	Yes
--dataset	Dataset to be used as external reference cohort. Available options are TCGA project IDs listed in TCGA projects summary table `Project` column (default is `PANCAN`)	No
--transform	Transformation method for converting read counts. Available options are: `CPM` (default) and `TPM`	No
--norm	Normalisation method. Available options are: `TMM` (default), `TMMwzp`, `RLE`, `upperquartile` or `none` for CPM-transformed data, and `quantile` (default) or `none` for TPM-transformed data	No
--batch_rm	Remove batch-associated effects between datasets. Available options are: `TRUE` (default) and `FALSE`	No
--filter	Filtering out low expressed genes. Available options are: `TRUE` (default) and `FALSE`	No
--log	Log (base 2) transform data before normalisation. Available options are: `TRUE` (default) and `FALSE`	No
--scaling	Apply `gene-wise` (default) or `group-wise` data scaling. Available options are: `TRUE` (default) and `FALSE`	No
--drugs	Include drug matching section in the report. Available options are: `TRUE` and `FALSE` (default)	No
--immunogram	Include immunogram in the report. Available options are: `TRUE` and `FALSE` (default)	No
--umccrise	Location of the corresponding umccrise output (including PCGR (see example), Manta (see example) and PURPLE (see example) output files) from genome-based data	No
--pcgr_tier	Tier threshold for reporting variants reported in PCGR (if PCGR results are available, default is `4`)	No
--pcgr_splice_vars	Include non-coding `splice_region_variant`s reported in PCGR (if PCGR results are available). Available options are: `TRUE` (default) and `FALSE`	No
--cn_loss	CN threshold value to classify genes within lost regions (if CN results from PURPLE are available, default is `5th percentile` of all CN values)	No
--cn_gain	CN threshold value to classify genes within gained regions (if CN results from PURPLE are available, default is `95th percentile` of all CN values)	No
--clinical_info	Location of xslx file with clinical information (see example )	No
--clinical_id	ID required to match sample with the subject clinical information (if available)	No
--subject_id	Subject ID. Note, if `umccrise` is specified (flag `--umccrise`) then subject ID is extracted from `umccrise` output files and used to overwrite this argument	No
--sample_source	Source of investigated sample (e.g. fresh frozen tissue, organoid; for annotation purposes only)	No
--sample_name_mysql	Desired sample name for MySQL insert command. By default value in `--sample_name` is used	No
--project	Project name (for annotation purposes only)	No
--top_genes	The number of top ranked genes to be presented (default is `5`)	No
--dataset_name_incl	Include dataset in the report name. Available options are: `TRUE` and `FALSE` (default)	No
--save_tables	Save interactive summary tables as HTML files. Available options are: `TRUE` (default) and `FALSE`	No
--hide_code_btn	Hide the Code button allowing to show/hide code chunks in the final HTML report. Available options are: `TRUE` (default) and `FALSE`	No
--grch_version	Human reference genome version used for genes annotation. Available options: `37` and `38` (default)	No

* Location of the results folder from either bcbio-nextgen RNA-seq or Dragen RNA pipeline is required.

Packages: required packages are listed in environment.yaml file.

Note

Human reference genome GRCh38 (Ensembl based annotation version 86) is used for genes annotation as default. Alternatively, human reference genome GRCh37 (Ensembl based annotation version 75) is used when argument grch_version is set to 37.

Examples

Below are command line use examples for generating Patient Transcriptome Summary report using:

WTS data only
WTS and WGS data
WTS WGS and clinical data

Note

make sure that the created conda environment (see Installation section) is activated

conda activate rnasum

RNAseq_report.R script (see the beginning of Usage section) should be executed from rmd_files folder

cd rmd_files

example data is provided in data/test_data folder
Usually the data processing and report generation would take less than 20 minutes using 16GB RAM memory and 1 CPU

1. WTS data only

In this scenario, only WTS data will be used and only expression levels of key Cancer genes, Fusion genes, Immune markers and homologous recombination deficiency genes (HRD genes) will be reported. Moreover, gene fusions reported in Fusion genes section will not contain inforamation about evidence from genome-based data. A subset of the TCGA pancreatic adenocarcinoma dataset is used as reference cohort (--dataset TEST ).

The input files are expected to be organised following the folder structure described in Input data:WTS section.

bcbio-nextgen

Rscript RNAseq_report.R  --sample_name test_sample_WTS  --dataset TEST  --bcbio_rnaseq $(pwd)/../data/test_data/final/test_sample_WTS  --report_dir $(pwd)/../data/test_data/final/test_sample_WTS/RNAsum  --save_tables FALSE

The interactive HTML report named test_sample_WTS.RNAsum.html will be created in data/test_data/final/test_sample_WTS/RNAsum folder.

Dragen RNA

Rscript RNAseq_report.R  --sample_name test_sample_WTS  --dataset TEST  --dragen_rnaseq $(pwd)/../data/test_data/stratus/test_sample_WTS  --report_dir $(pwd)/../data/test_data/stratus/test_sample_WTS/RNAsum  --save_tables FALSE

The interactive HTML report named test_sample_WTS.RNAsum.html will be created in data/test_data/stratus/test_sample_WTS/RNAsum folder.

2. WTS and WGS data

This is the most frequent and preferred case, in which the WGS-based findings will be used as a primary source for expression profiles prioritisation. The genome-based results can be incorporated into the report by specifying location of the corresponding umccrise output files (including results from PCGR, PURPLE and Manta) using --umccrise argument. The Mutated genes, Structural variants and CN altered genes sections will contain information about expression levels of the mutated genes, genes located within detected structural variants (SVs) and copy-number (CN) altered regions, respectively. The results in the Fusion genes section will be ordered based on the evidence from genome-based data. A subset of the TCGA pancreatic adenocarcinoma dataset is used as reference cohort (--dataset TEST ).

The umccrise output files are expected to be organised following the folder structure described in Input data:WGS section.

bcbio-nextgen

Rscript RNAseq_report.R  --sample_name test_sample_WTS  --dataset TEST  --bcbio_rnaseq $(pwd)/../data/test_data/final/test_sample_WTS  --report_dir $(pwd)/../data/test_data/final/test_sample_WTS/RNAsum  --umccrise $(pwd)/../data/test_data/umccrised/test_subject__test_sample_WGS  --save_tables FALSE

The interactive HTML report named test_sample_WTS.RNAsum.html will be created in data/test_data/final/test_sample_WTS/RNAsum folder.

Dragen RNA

Rscript RNAseq_report.R  --sample_name test_sample_WTS  --dataset TEST  --dragen_rnaseq $(pwd)/../data/test_data/stratus/test_sample_WTS  --report_dir $(pwd)/../data/test_data/stratus/test_sample_WTS/RNAsum  --umccrise $(pwd)/../data/test_data/umccrised/test_subject__test_sample_WGS  --save_tables FALSE

The interactive HTML report named test_sample_WTS.RNAsum.html will be created in data/test_data/stratus/test_sample_WTS/RNAsum folder.

3. WTS WGS and clinical data

For samples derived from subjects, for which clinical information is available, a treatment regimen timeline can be added to the Patient Transcriptome Summary report. This can be added by specifying location of a relevant excel spreadsheet (see example test_clinical_data.xlsx) using the --clinical_info argument. In this spreadsheet, at least one of the following columns is expected: NEOADJUVANT REGIMEN, ADJUVANT REGIMEN, FIRST LINE REGIMEN, SECOND LINE REGIMEN or THIRD LINE REGIMEN, along with START and STOP dates of corresponding treatments. A subset of the TCGA pancreatic adenocarcinoma dataset is used as reference cohort (--dataset TEST ).

bcbio-nextgen

Rscript RNAseq_report.R  --sample_name test_sample_WTS  --dataset TEST  --bcbio_rnaseq $(pwd)/../data/test_data/final/test_sample_WTS  --report_dir $(pwd)/../data/test_data/final/test_sample_WTS/RNAsum  --umccrise $(pwd)/../data/test_data/umccrised/test_subject__test_sample_WGS  --clinical_info $(pwd)/../data/test_data/test_clinical_data.xlsx --save_tables FALSE

The interactive HTML report named test_sample_WTS.RNAsum.html will be created in data/test_data/final/test_sample_WTS/RNAsum folder.

Dragen RNA

Rscript RNAseq_report.R  --sample_name test_sample_WTS  --dataset TEST  --dragen_rnaseq $(pwd)/../data/test_data/stratus/test_sample_WTS  --report_dir $(pwd)/../data/test_data/stratus/test_sample_WTS/RNAsum  --umccrise $(pwd)/../data/test_data/umccrised/test_subject__test_sample_WGS  --clinical_info $(pwd)/../data/test_data/test_clinical_data.xlsx  --save_tables FALSE

The interactive HTML report named test_sample_WTS.RNAsum.html will be created in data/test_data/stratus/test_sample_WTS/RNAsum folder.

Output

The pipeline generates html-based Patient Transcriptome Summary report and results folder within user-defined output folder:

|
|____<output>
  |____<SampleName>.<output>.html
  |____results
    |____exprTables
    |____glanceExprPlots
    |____...

Report

The generated html-based Patient Transcriptome Summary report includes searchable tables and interactive plots presenting expression levels of altered genes, as well as links to public resources describing the genes of interest. The report consist of several sections, including:

* if clinical information is available; see --clinical_info argument
** if genome-based results are available; see --umccrise argument

Detailed description of the report structure, including results prioritisation and visualisation is available in report_structure.md.

Results

The results folder contains intermediate files, including plots and tables that are presented in the report.

Docker

Pull ready to run docker image from DockerHub

docker pull umccr/rnasum:0.3.2

An example command to use this pulled docker container is:

docker run --rm -v /path/to/RNAseq-report/RNAseq-Analysis-Report/envm/wts-report-wrapper.sh:/work/test.sh -v /path/to/RNAseq-report/RNAseq-Analysis-Report/data:/work c18db89d3093 /work/test.sh

Assumptions
- You are running the RNAsum container against the RNAsum code and test/reference data

mil2041/RNAsum

RNAsum

Table of contents

Installation

Workflow

Reference data

External reference cohorts

Note

Internal reference cohort

Note

Input data

WTS

bcbio-nextgen

Note

Dragen RNA

WGS

Usage

Arguments

Note

Examples

Note

1. WTS data only

bcbio-nextgen

Dragen RNA

2. WTS and WGS data

bcbio-nextgen

Dragen RNA

3. WTS WGS and clinical data

bcbio-nextgen

Dragen RNA

Output

Report

Results

Docker