/Pipeline-of-transcriptome

The pipeline of transcriptome analysis

Primary LanguagePython

Pipeline of transcriptome data analysis

Overview

This workshop recorded the whole processing steps of transcriptome data analysis in CC-LY Lab written by Shawn (Xiangyu) Pan and Xuelan Chen. This page would be helpful and easy to be read and operated, especially for the bioinformatic new-hand. We will try to keep updating of Pipeline-of-transcriptome. And this pipeline is flexible, you could broaden more analysis steps and tools in tegrated into this page, such as GSEA analysis, TF enrichment, bulk RNA-seq data deconvolution and et al. We also expected you could add comments and provide request to improve this page. Hope you could had a good grip of the basic transcriptome data analysis rapidly and smoothly

The analysis pipeline included

  • Alignment
  • Transcription quantification
    • GenomicsFeatures and Rsamtools
    • Stringtie
    • RSEM
  • DEG identification
    • The summary of the methods to calculate the p-value
  • GO/KEGG enrichment
  • GSEA
  • Alternative splicing
  • Motif/TF identification
  • RNA editing
  • Mutation
  • et al.

1. The pre-processing steps

In this page, GenomicAlignments and Rsamtools were used to quantify the counts of transcriptome data. In old version, we used FPKM and TPM for heatmap visualization and gene set enrichment analysis, however, in latest version, DESeq2 normalized data , which was much better to reduce the effect of gene body and library size, were used to describe the expression pattern of each gene. And the pathways enrichment also based on the DESeq2 normalized data, especially for GSEA processing.

Here, DESeq2 pipeline also was used to identify the differentiated expressed genes. There were some essential parameters to set the cutoff of DEG detecting in this pipeline. The detail information would be explained in following pages. To direct visualize the DEGs' function, clusterprofiler was implemented in this pipeline. GO/KEGG database could be enriched by DEGs with default parameter. Besides, we also integrated the GSEA processing in following page.

  • Before, we used this pipeline, there were some softwares should be installed:
#STAR
STAR_2.6.0a

#Rscript
R scripting front-end version 3.5.1 (2018-07-02)

2. The post-processing steps

After you running out the pre-processing steps, you could directly begin The quantification of genes and the identification of DEG. You could could visit the page by clinking here.

3. The optional methods in transcripts quantification and p-value calculation

3.1 The summary of quantification of transcripts methods

3.2 The summary of some statistic methods

  • When we compared the expression levels of candidate gene in different biology group, statistic power is so important that could determine the confidence of the results. To better support our hypothesis of candidate genes, especially doing analysis in multiple clinical cohorts, we could refer to more methods of p-value calculation.
  • Here, I had generated a summary of the methods to calculate the p-value in DEG identification. And you also could visit them by clicking here

4. The identification of alternative splicing events

5. Keep updating