This is the generic pipeline to perform pairwise DE analysis from the count matrix. It is designed to work with pairwise design i.e. typical Treatment Vs Control experiment type for various biological conditions. This pipeline performs following operations:
- Read the input count matrix (pairwise design only).
- Generate visualizations for exploratory analysis (PCA, Dendrogram, distance boxplot, sample-distance heatmap).
- Perform DE analysis using DESeq2 and edgeR packages.
- Determine DE genes at two significance cutoffs (FDR <= 0.01 and FDR <= 0.05) and generate output tables with annotation.
- Additional visualizations for DE results (MA plots, Venn diagrams).
- Generate GSEA Pre-ranked list (DESeq2 based) for GSEA analysis.
Pipeline assume working R environment is available with following packages installed.
library(DESeq2)
library(ggplot2)
library(gplots)
library(tidyverse)
library(RColorBrewer)
library(pheatmap)
library(edgeR)
library(VennDiagram)
Example: The first column should have strict name Gene_ID, followed by control and treatment replicates denoted with numbers.
Gene_ID | control1 | control2 | control3 | treated1 | treated2 | treated3 |
---|---|---|---|---|---|---|
ENSG00000000003 | 1013 | 664 | 978 | 1090 | 1074 | 1025 |
ENSG00000000005 | 0 | 0 | 0 | 0 | 1 | 0 |
ENSG00000000419 | 1322 | 860 | 1145 | 1264 | 1301 | 1163 |
ENSG00000000457 | 83 | 87 | 67 | 94 | 165 | 105 |
ENSG00000000460 | 305 | 262 | 258 | 261 | 376 | 237 |
ENSG00000000938 | 2 | 0 | 0 | 0 | 0 | 1 |
Example:
Gene_ID | Gene type | Gene name | Gene description |
---|---|---|---|
ENSG00000229147 | unprocessed_pseudogene | SMPD4P2 | sphingomyelin phosphodiesterase 4 pseudogene 2 [Source:HGNC Symbol;Acc:HGNC:39674] |
ENSG00000256453 | protein_coding | DND1 | DND microRNA-mediated repression inhibitor 1 [Source:HGNC Symbol;Acc:HGNC:23799] |
ENSG00000185813 | protein_coding | PCYT2 | phosphate cytidylyltransferase 2, ethanolamine [Source:HGNC Symbol;Acc:HGNC:8756] |
ENSG00000268861 | protein_coding | AC008878.3 | Rho/Rac guanine nucleotide exchange factor 18 [Source:NCBI gene;Acc:23370] |
ENSG00000281782 | unprocessed_pseudogene | AC093642.6 | F-box protein 25 (FBXO25) pseudogene |
ENSG00000176749 | protein_coding | CDK5R1 | cyclin dependent kinase 5 regulatory subunit 1 [Source:HGNC Symbol;Acc:HGNC:1775] |
Clone of download the repository:
git clone https://github.com/sagarutturkar/RNASeq_DE_Pipeline.git
Syntax (for Rstudio)
system("RScript DE_pairwise_pipeline.R <Count_Matrix> <Control_Name> <Treatment_Name> <Number_of_Control_replicates> <Number_of_Treatment_replicates> <Annotation.TXT>")
system("RScript Make_venn.R DESeq2_FDR005_filtered.tsv edgeR_FDR005_filtered.tsv DESeq2 edgeR <custom_TAG> <Annotation.TXT>")
Working Example (for Rstudio)
system("RScript DE_pairwise_pipeline.R C1.TXT control treated 3 3 Annotation.TXT")
system("RScript Make_venn.R DESeq2_FDR005_filtered.tsv edgeR_FDR005_filtered.tsv DESeq2 edgeR FDR005 Annotation.TXT")
On Linux (command line):
cd test_data
Rscript ../DE_pairwise_pipeline.R C1.txt control treated 3 3 Annotation.txt
Rscript ../Make_Venn.R DESeq2_FDR005_filtered.tsv edgeR_FDR005_filtered.tsv DESeq2 edgeR FDR005 Annotation.txt
Result tables are TAB delimited TEXT files and best viewed when opened in Excel.
File Type | File Name | File Description |
---|---|---|
Log Files | sessioninfo.txt | Session Info for the current run |
edgeR_log.txt | Log from the edgeR package | |
DESeq2_log.txt | Log from the DESeq2 package | |
Rdata Files | edgeR_results.RData | edgeR results stored as Rdata Object |
DESeq2_results.RData | DESeq2 results stored as Rdata Object | |
Exploratory data visualizations | DESeq2_Cluster_Dendrogram.png | Cluster dendrogram for control and treatment samples |
DESeq2_Cooks_distance_boxplot.png | Boxplot showing the distribution of Cook’s distances for each library | |
Sample_distance_matrix.png | Heatmap showing the Euclidian distances between samples | |
DESeq2_PCA_with_Labels.png | PCA plot with lables | |
edgeR_mds.png | PCA plot from edgeR | |
DESeq2 Result Tables | DESeq2_All.tsv | Complete Table from DESeq2 (no filter) |
DESeq2_FDR001_filtered.tsv | DESeq2 significant DE genes (FDR <= 0.01) | |
DESeq2_FDR005_filtered.tsv | DESeq2 significant DE genes (FDR <= 0.05) | |
edgeR Result Tables | edgeR_All.tsv | Complete Table from edgeR (no filter) |
edgeR_FDR001_filtered.tsv | edgeR significant DE genes (FDR <= 0.01) | |
edgeR_FDR005_filtered.tsv | edgeR significant DE genes (FDR <= 0.05) | |
DE Result visualizations | edgeR_smear.png | MA plot from edgeR |
DESeq2_MA.png | MA plot from DESeq2 | |
Comaprison of DESeq2 and edgeR | FDR005_Overlap.tsv | Table for the overlapping DE genes (FDR <= 0.05) |
FDR005_Venn.png | Venn diagram for overlap between DESeq2 and edgeR DE genes at FDR <= 0.05 | |
GSEA Pre-ranked file | GSEA.rnk | GSEA pre-ranked file |
Eample data is provided in the directory test_data. It contains a random counts matrix and annotations (Human).
Steps:
- Download the two R scripts (DE_pairwise_pipeline.R and Make_Venn.R).
- Download the test_data directory.
- Run the R scripts as below:
system("RScript DE_pairwise_pipeline.R C1.TXT control treated 3 3 Annotation.TXT")
system("RScript Make_venn.R DESeq2_FDR005_filtered.tsv edgeR_FDR005_filtered.tsv DESeq2 edgeR FDR005 Annotation.TXT")
system("RScript Make_venn.R DESeq2_FDR001_filtered.tsv edgeR_FDR001_filtered.tsv DESeq2 edgeR FDR001 Annotation.TXT")
Runtime: On standrad laptop with 1 processor, the run with test data should complete under 10 minutes.