/RNASeqWF

This code shows the validation of Differential Gene Expression Analysis from RNASeq data using as quantificator tools TPMCalculator

Primary LanguageHTMLOtherNOASSERTION

RNASeqWF

This project was designed to evaluate a Differential Gene Expression Analysis workflow following the same approach published in Empirical assessment of analysis workflows for differential expression analysis of human samples using RNA-Seq by Williams at al.

We tested five workflows for the DGA. All workflows used the same aligner (STAR) and quantification (TPMCalculator) tool but different DGA software: EdgeR, Deseq2, SAMSeq, the union and the intercept off all identified genes from all DGA tools. Recall and precision values were calculated using the same approach described in the paper using the expressions:

genes in reference and genes identified

We used paper's figure 5a to plot our results for comparison with the paper published results. Although, we obtained recall values under the fitted line our precision is over the fitting line showing better results than those published in the paper, see final plots here.

This analysis is testing the whole workflow STAR-TPMCalculator-DGA tool. In our case, we use the same tools than in the paper for alignment (STAR) and DGA (EdgeR, Deseq2 and SAMSeq). The only difference is the quantification step using TPMCalculator. Additionally, we used their scripts and parameters for the DGA tools which all are R packages.

We see an increment of the precision despite of using the same first and last steps published in the paper for the STAR-quantification-DGA based workflows. We concluded that the increment in precision is due to the introduction of the TPMCalculator tool.

Our workflow is based on a set of Jupyter Notebooks and CWL workflows. The workflows excuted the analysis using the following tools:

  • FastQC, for pre-processing quality control
  • Trimmomatic, for reads trimming
  • STAR, for reads alignment
  • RSeQC, for alignment quality control
  • TPMCalculator, for mRNA abundance quantification
  • Deseq2, for DGA
  • EdgeR, for DGA
  • SAMseq, for DGA

Workflow steps

  1. Sample retrieval from SRA database
  2. Pre-processing QC
  3. Trimming
  4. Alignment
  5. Alignment Quality Control
  6. Quantification
  7. Differential Gene Expression Analysis
  8. Correlation with published results

Requirements

  1. Python 3.6
  2. CWL Tools definition files: cwl-workflow

Public Domain notice

National Center for Biotechnology Information.

This software is a "United States Government Work" under the terms of the United States Copyright Act. It was written as part of the authors' official duties as United States Government employees and thus cannot be copyrighted. This software is freely available to the public for use. The National Library of Medicine and the U.S. Government have not placed any restriction on its use or reproduction.

Although all reasonable efforts have been taken to ensure the accuracy and reliability of the software and data, the NLM and the U.S. Government do not and cannot warrant the performance or results that may be obtained by using this software or data. The NLM and the U.S. Government disclaim all warranties, express or implied, including warranties of performance, merchantability or fitness for any particular purpose.

Please cite NCBI in any work or product based on this material.