/coGSEA

coGSEA : Comparative Gene Set Enrichment Analysis

Primary LanguageHTML

version : 0.99

Table of Content

Quick start

coGSEA(ElistObject = elist, contrastMatrix = contrast, ENTREZGenesId = elist$genes$ENTREZ, geneSetCollection = "H", specie = "Mus musculus", directoryPath = "/path/to/existing/dir")

Introduction

coGSEA (comparative Gene Set Enrichment Analysis) is a R package developped to help you compare and combine up to 12 different methods of Gene Set Enrichment Analsysis. In the current version, those 12 methods include :

Disclaimer : This tool is largely inspired by the eGSEA R package (and contains some of its code)

How to install

devtools::install_github("vguillemot/coGSEA")

Inputs

  • ElistObject : a Elist object. Your Elist object must be either a limma-voom object in case of RNAseq data, or have a estimated dispersion in case of MicroArray data. Dispersion estimates can be performed using the estimateDisp() function from the edgeR package.

  • contrastMatrix : the contrast matrix object for your experiment. For some help to make a contrast matrix, you can check the vignette of the limma package.

  • ENTREZGenesIds : the Elist object containing the ENTREZ gene identifiers. For example elist$genes$ENTREZ

Parameters

  • geneSetCollection : A geneset collection from the MSigDB database. Currently only three are supported : H (Hallmark) and two subset of C2_KEGG and C2_REACTOME(Kegg and Reactome). You can also use your own geneset collection. In this case, you need to provide an list object with genesets and genes. Such an example file can be found here

  • Specie : The organism from which the data were extracted. Currently only Homo sapiens and Mus musculus are supported. Default : Mus musculus

  • directoryPath : a path to an existing directory where coGSEA results and plots will be saved. Default : "./"
    Example : "~/coGSEA_results"

  • alpha : Alpha p value threshold. Default : 0.05
    example : 0.05

  • pvalAdjMethod : p value adjustment method for multiple testing. Default : BH.
    To select among the following methods :

    • holm
    • hochberg
    • hommel
    • bonferroni
    • BH (Benjamini Hochberg)
    • BY (Benjamini Yekutieli)
    • fdr (False Discovery rate)
    • none
  • pvalCombMethod : the method to combine all the p.values of one geneset accros the different GSEA methods. Default : sumlog
    To select among the following methods :

    • sumz (sum z method)
    • votep (vote counting method)
    • minimump (Wilkinson's method)
    • sumlog (Fisher's method) - selected by default
    • sump (sum p method)
    • logitp (logit method)
    • meanp (mean p method)
    • maximump (Wilkinson's method)
  • min.intersection.size : Graphical Parameter to select the minimum size of an intersection of genesets between different methods. Default : 1
    Example : 2

  • GSEA.Methods : between 1 and 12 methods to select from the methods listed in the introduction. Default : c("camera", "gage","globaltest", "gsva", "ssgsea", "zscore", "plage", "ora", "padog", "roast","safe")
    Example : c("camera", "gage","globaltest")

  • num.workers : number of thread for multithreading to decrease execution time. Default : 4
    Example : 4

  • shinyMode : Boolean value. Shouldn't be changed. Only used when running coGSEA is running in the background of a shiny application. Default : FALSE

Data preparation for running coGSEA

MicroArray data

This example dataset is from a article about Astrocymotas tumors published in 2011 by Liu et al.

You can download it on the Gene Expression Omnibus database with the accession number GSE19728

Loading the necessary packages for this analysis

library(affy)
library(hgu133plus2.db)
library(edgeR)
library(gage)
library(coGSEA)

Loading the CEL FILES and Normalizing them usin rma

setwd("~/GitLab/GSE19728/data/")
celfiles = ReadAffy()
celfiles = rma(celfiles)

Getting the expression data, and the Probe Names

intensity = exprs(celfiles)
intensity = cbind(rownames(intensity), intensity)
colnames(intensity)[1] = "PROBEID"
intensity = as.data.frame(intensity)
intensity$PROBEID = as.character(intensity$PROBEID)

Getting the annotations of Probes IDS to ENTREZ accession numbers

annots = select(hgu133plus2.db, intensity$PROBEID, "ENTREZID", "PROBEID")

Merge expression matrix and annotation

res = merge(intensity, annots, by= "PROBEID")

Getting rid of PROBE ids and type casting

resmin = res[,2:ncol(res)]
cname = colnames(resmin)
resmin = apply(resmin, 2, as.numeric)
colnames(resmin)= cname
resmin = as.data.frame(resmin)

Aggregating the PROBES matching the same ENTREZ accession number by averaging them

result = aggregate(. ~ ENTREZID, resmin, mean)
result$ENTREZID = levels(as.factor(as.character(resmin$ENTREZID)))
rownames(result) = result$ENTREZID
result = result[,-1]

Visualizing in boxplot to check normalization

boxplot(result, las = 2)

Changing column names to remove the .CEL filename extension

colnames(result) = gsub(".CEL","", colnames(result))
print(colnames(result))

Selecting only the samples we are interested in

result2 = cbind(result$GSM492649_Astrocytomas_N, result$GSM525014, result$GSM525015, result$GSM525016, result$`GSM492662_Astrocytomas_T4-1`, result$`GSM492663_Astrocytomas_T4-2` , result$`GSM492664_Astrocytomas_T4-3`, result$`GSM492665_Astrocytomas_T4-4`, result$`GSM492666_Astrocytomas_T4-5`)
colnames(result2) = c("GSM492649_Astrocytomas_N", "GSM525014", "GSM525015", "GSM525016","GSM492662_Astrocytomas_T4-1", "GSM492663_Astrocytomas_T4-2", "GSM492664_Astrocytomas_T4-3", "GSM492665_Astrocytomas_T4-4", "GSM492666_Astrocytomas_T4-5" )
rownames(result2) = rownames(result)

Preparing the design matrix

Normal = c(rep(1,4),rep(0,5))
Tumor = c(rep(0,4),rep(1,5))
design = cbind(Normal, Tumor)
rownames(design) = colnames(result2)

Preparing the contrast matrix

contr.matrix = makeContrasts(NormalVSTumor = Normal - Tumor, levels = design)

Preparing expression list object

temp = new("EList")
temp$design = design
temp$E = as.matrix(result2)
rownames(temp$E) = as.numeric(rownames(temp$E))
temp$genes$ENTREZ = rownames(result2)
temp$common.dispersion = estimateDisp(temp$E, design = temp$design)$common.dispersion
temp$samples = colnames(result2)

Preparing gene set collection

gs = gage::kegg.gsets(species = "hsa", id.type = "entrez")
geneset = gs$kg.sets

Some GSEA methods do not work properly with exotic gene names, so we need to simplify them

Function for simplifying gene sets names

nameshorter = function(names){
  namemod = c()
  for (i in seq(1,length(names))){
    namemod[i] = paste(strsplit(names[i], " ")[[1]][-1], sep = "", collapse = " ")
    namemod[i] = gsub("/","", names[i])
    namemod[i] = gsub(" ","_", names[i])
  }
  return(namemod)
}

Simplifying gene sets names

names(geneset) = nameshorter(names(geneset))
names(geneset) = gsub("/","_",names(geneset))

Saving necessary objects to RDS files

saveRDS(contr.matrix, "~/GitLab/GSE19728/contrast.rds")
saveRDS(temp, "~/GitLab/GSE19728/elist.rds")
saveRDS(geneset, "~/GitLab/GSE19728/geneset.rds")

Reading necessary objects (generated above) from RDS files

elist = readRDS("~/GitLab/GSE19728/elist.rds")
contrast = readRDS("~/GitLab/GSE19728/contrast.rds")
geneset = readRDS("~/GitLab/GSE19728/geneset.rds")

Running coGSEA analysis

coGSEA(ElistObject = elist, contrastMatrix = contrast, ENTREZGenesIds = elist$genes$ENTREZ, geneSetCollection = geneset,specie = "Homo sapiens", directoryPath = "~/GitLab/GSE19728/results", alpha = 0.05, pvalAdjMethod = "BH", pvalCombMethod = "sumlog",min.intersection.size = 1, GSEA.Methods = c("camera", "gage","globaltest", "gsva", "ssgsea", "zscore", "ora", "padog", "roast","safe"), num.workers = 4, shinyMode = FALSE)

RNAseq data

You can download this example dataset on the Gene Expression Omnibus database with the accession number GSE63310.
This dataset was analyzed in a very detailed article on how to do differential expression analysis that we strongly advise you to read.
The file you're looking for is : GSE63310_RAW.tar

Loading necessary packages

library(edgeR)
library(limma)
library(Mus.musculus)
library(coGSEA)

Reading files

setwd("~/GitLab/GSE63310/data/")
files <- c(
"GSM1545535_10_6_5_11.txt",
"GSM1545536_9_6_5_11.txt",   
"GSM1545538_purep53.txt",
"GSM1545539_JMS8-2.txt",
"GSM1545540_JMS8-3.txt",
"GSM1545541_JMS8-4.txt",
"GSM1545542_JMS8-5.txt",
"GSM1545544_JMS9-P7c.txt",
"GSM1545545_JMS9-P8c.txt")
x <- readDGE(files, columns=c(1,3))

Simplifying file names

colnames(x) = substring(colnames(x), 12, nchar(colnames(x)))

Grouping by sample condition

group <- as.factor(c("LP", "ML", "Basal", "Basal", "ML", "LP", "Basal", "ML", "LP"))
x$samples$group <- group

Grouping by lane

lane <- as.factor(rep(c("L004","L006","L008"), c(3,4,2)))
x$samples$lane <- lane
x$samples

Annotation

geneid <- rownames(x)
genes <- select(Mus.musculus, keys=geneid, columns=c("SYMBOL", "TXCHROM"),keytype="ENTREZID")
dim(genes)
head(genes)

Getting rid of duplicated annoations by keeping only the first one

genes <- genes[!duplicated(genes$ENTREZID),]

Count per million of reads

cpm <- cpm(x)

Removing genes lowly expressed (genes with 0 expression across all samples)

#Removing genes lowly expressed (genes with 0 expression across all samples)
table(rowSums(x$counts==0)==9)

keep.exprs <- rowSums(cpm>1)>=3
x <- x[keep.exprs,, keep.lib.sizes=FALSE]
dim(x)

Normlization with edgeR

x <- calcNormFactors(x, method = "TMM")
x$samples$norm.factors

Making the design matrix

design <- model.matrix( ~ 0 + group + lane)
colnames(design) <- gsub("group", "", colnames(design))
rownames(design) = colnames(x)

Making the contrast matrix

contr.matrix <- makeContrasts(
  BasalvsLP = Basal-LP,
  BasalvsML = Basal - ML,
  LPvsML = LP - ML,
  levels = colnames(design))

contr.matrix

Applying voom transformation

v <- voom(x, design, plot=F)
v$genes$ENTREZ = rownames(v$E)

Saving objects

saveRDS(v, "~/GitLab/GSE63310/elist.rds")
saveRDS(contr.matrix, "~/GitLab/GSE63310/contrast.rds")

Reading RDS objects (previously genereated above)

elist = readRDS("~/GitLab/GSE63310/elist.rds")
contrast = readRDS("~/GitLab/GSE63310/contrast.rds")

Running coGSEA

coGSEA(ElistObject = elist, contrastMatrix = contrast, ENTREZGenesIds = elist$genes$ENTREZ, geneSetCollection = "C2_KEGG",specie = "Mus musculus", directoryPath = "~/GitLab/GSE63310/results", alpha = 0.05, pvalAdjMethod = "BH", pvalCombMethod = "sumlog",min.intersection.size = 1, GSEA.Methods = c("camera", "gage","globaltest", "gsva", "ssgsea", "zscore", "ora", "padog", "roast","safe"), num.workers = 4, shinyMode = FALSE)

Output

Files

  • abbreviations_[condition_name].csv: gene-sets abbreviations mapping to gene-sets accession numbers and names.

  • geneset_collection_genes.csv: genes of the dataset present in the gene-sets

  • result_[condition_name].csv: result table with indivudual methods ranking, p.value, adjusted p.value. Average ranks, Average logFC, and combined p.value

Plots

  • coGSEA_methods_runtime.pdf: barplot of the runtime of the different GSE methods (in seconds)

  • [condition_name]_clustering.pdf: Clustering of the GSE methods on the gene-sets ranks (bases on raw p.values).

  • [condition_name]_correlation.pdf: Correlation Plot of the GSEA methods on the gene-sets ranks (bases on raw p.values).

  • [condition_name]_pca.pdf: PCA of the GSEA methods on the gene-sets ranks (bases on raw p.values).

  • [condition_name]_eigen_fall.pdf: Fall of the eigen values of the PCA.

  • [condition_name]_snailplot.pdf: Mset Ensemblist plot showing the intersection of the different gene-sets found significantly enriched (p.value < alpha) by each GSEA method. Methods retrieving 0 significantly enriched gene-sets are not included.

  • [condition_name]_upsetr.pdf: UpsetR Ensemblist plot showing the intersection of the different gene-sets found significantly enriched (p.value < alpha) by each GSEA method. Methods retrieving 0 significantly enriched gene-sets are not included. Only exclusive intersections are taken into account.

  • [condition_name]_heatmap.pdf: Binary heatmap showing whether a gene-set is found differentially (red) expressed (p.value < alpha) by a GSEA method or not (blue). Only the gene-sets belonging the biggest intersections size (n), and the second biggest intersections size (n-1) are included.

  • [condition_name]_advanced_sumplot_rank.pdf: Summary plot combining both the logFC on the y-axis and the -log10(p.value) on the x-axis. The size of the bubbles indicates the size of the gene-set (number of genes) while the color indicates the average rank of the gene-set.

  • [condition_name]_advanced_sumplot_direction.pdf: Summary plot combining both the logFC on the y-axis and the -log10(p.value) on the x-axis. The size of the bubbles indicates the significance (combination of p.value and logFC) while the color indicates the direction of logFC.

  • [condition_name]_rank_pval_corplot.pdf: Scatterplot of combined p.values and average ranks with -log10(combined adjusted p.value) on x-axis, and average rank on the y-axis. Sperman correlation coefficient shown in the title.