/bcbio.rnaseq

Automatic RNA-seq DE caller benchmarking

Primary LanguageClojure

<script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"> </script>

Overview

Build Status DOI

There are a large number of algorithms designed to call differential events from RNA-seq data at the gene, isoform and splicing-event level. It would be great if when a new caller comes out, there was a way to automatically place it in the context of the other callers. Every now and then a review article comes out which compares a subset of the tools against each other. This project aims to do provide a tool to do that automatically.

Quickstart

  1. Install leiningen
  2. Install pandoc (at least v1.12.3)
  3. Download the bcbio.repository
git clone https://github.com/roryk/bcbio.rnaseq
  1. Install the necessary libraries
cd bcbio.rnaseq
Rscript resources/scripts/install_libraries.R
  1. Make an executable.
make

This will make an executable in bin/bcbio-rnaseq that you can move to wherever you want and run.

  1. Run a simulation of sample size 5
bcbio-rnaseq simulate -s 5 -d output_dir

In the output_dir you will find plots comparing the output from the callers against the set of simulated true positives. For example here are concordance plots vs. the set of true simulated RNA-seq genes for an experiment with 3 replicates:

vs 30 replicates:

The simulator simulates a range of fold changes and the two concordance plots are faceted on the log2 fold change cutoff listed at the top. The simulated set has 200 additional genes that are DE between each cutoff, for a total of 1000 DE genes. For three samples you can see almost all DE genes are called at a 4-fold change. At a 2-fold change only around half of the correct DE genes are called and almost none for fold changes lower than that. For 30 samples genes with a lower fold change are called much more accurately.

Other useful plots are the Jaccard index:

and the area under the ROC curve for each fold change cutoff:

Modes of operation

Simulation

bcbio.rnaseq can be used to simulate a RNA-seq experiment of a given sample size. Running bcbio.rnaseq with:

bcbio-rnaseq simulate -s 3 -d sample_size_3

Will output the results of simulating an experiment of size 3 to the directory sample_size_3. The simulator simulates an experiment with 20 million reads with 10,000 total genes, 1000 of which are differentially expressed, 200 at five different fold change levels.

bcbio-nextgen comparison

This mode is invoked with

bcbio-rnaseq compare path-to-project-summary.yaml metadata-to-test-on

and takes the output from a bcbio-nextgen RNA-seq analysis and runs several different DE callers and compares the results to each other. If your run includes ERCC spike-in data bcbio.rnaseq will detect this automatically and run a concordance analysis on the ERCC data. Passing the --seqc flag turns on a a diagnostic mode for determining how well DE callers work against a reference set of ~ 1000 qPCR assayed genes from the SEQC project. The true set of DE calls were made on the qPCR data via the hacky method of performing a t-test and BH correcting the p-values, and calling anything with a FDR < 0.05 as differentially expressed. To run in this mode you would first download and prepare the SEQC data, align and quantitate it with bcbio-nextgen and run bcbio.rnaseq on the results.

bcbio-nextgen report

This mode produces a Rmd file with a report of a bcbio-nextgen run. If you run it without passing a formula, it will just generate a quality-control report. If you pass a formula like this:

bcbio-rnaseq summarize path-to-project-summary-yaml -f "~batch+panel"

It will run a DESeq2 analysis with the specified model and prepare reports of all pairwise comparisons for the last factor in the model. An example report is available here.

Status

If you have cufflinks and R installed and in your path, this should work for you. If it doesn't please open an issue and we'll fix it.

Adding new R-based DE callers

If you make a R-file template called name-of-caller.template and stick it in the resources directory, it will automatically be added to the list of callers to run. The templates have to handle three things: transforming the counts to something the caller can use, resolving the groups to be compared from a list indicating which column in the table goes with which group and outputting a table with the DE calls in the right format. Each file takes three inputs:

  1. count-file: a file of counts of reads mapping to genes, with the rows the IDs and the columns the sample names. This will be provided to your script.
  2. class: this is a list which pairs each column with each condition to run. Right now this only does simple two-factor analyses.
  3. out-file: In your R script the results should be transformed to have these columns::
    id, expr, logFC, pval, padj, algorithm

where id is the gene ID, expr is the normalized average counts for all of the conditions for that gene, logFC is the log2 fold change, pval the p-value that condition 1 is different than condition 2, padj the Benjamini-Hochberg adjusted p-value and algorithm the name for this caller that you want to show up on the graphs.

An example is clearer: a proper template, deseq and the output, sample-output.