AbHAC R package

AbHAC: Aberration Hub analysis of Cancer

AbHAC is an R package for implementation of a simplistic approach for analysis of cancer genomics datasets in context of protein interaction networks. In AbHAC, each protein in the protein interaction network is considered as an individual subnetwork and based on abundance of molecules with aberrations at genomic or transcriptomic levels among neighborhood of that protein as well as the whole interactome, a Fisher's exact test p-value is calculated. The Fisher's exact test p-values are then corrected for multiple testing by permutation of the protein interaction network. Details are available in the paper/thesis manuscript.

Usage: Required objects

snv: a matrix/dataframe where column names represent name of samples and rownames represent name of genes. Value of each cell can either be NA or a character (e,g. "Mutated")

rna: a numeric matrix/dataframe where column names are sample names similar and in same order with snv. However, these names must be accompanied with T at the end of their name. For example, if sample names in snv are: a | b | c ..., in rna they should be: aT | bT | cT ... . These must be followed with nontumour samples ending with N. It is possible for the nontumour samples to have the same name (aN | bN ...) or something different (a2313N | a321bchN).

Usage: Optional objects

clinical: A dataframe with first column having the same names as snv, and the second column providing information about samples. These can be Metastasis/Primary, HighGrade/LowGrade or any other sets of strings describing patients subtypes.

ppi.database: By default, the package uses a protein interaction network built using PSICQUIC by querying for uniprot accession IDs obtained through Uniprot.ws package. The databases used for generating this dataframe include:

DIP, InnateDB, IntAct, MatrixDB, MINT, I2D-IMEx, InnateDB-IMEx, MolCon and BindingDB

The AbHAC functions require the first 2 columns of this dataframe. The IDs must be uniprot accession.

id.conversion.set: A dataframe with the following column names:

ENTREZ_GENE	UNIPROTKB	GENES	ENSEMBL	REFSEQ_PROTEIN
7533	Q04917	YWHAH	ENSG00000128245	NP_003396

Usage: Examples

Installing the package and all of its dependencies:

install.packages(c("devtools", "foreach", "doMC", "iterators" ,"plyr"))
source("http://bioconductor.org/biocLite.R")
biocLite("EdgeR")
require(devtools)
install_github("AbHAC", username="mehrankr")
require("AbHAC")

abhac.brief is implemented to be used when a particular set of genes are of interest and we want to investigate the proteins that might interact with a significant number of our set of genes. These set of genes might be mutated (snv), upregulated (de.up) or downregulated(de.down).

Running abhac.brief with vector of mutated/upregulated/downregulated genes:

#Loading matrix of mutated genes and matrix of mRNA expression
data(snv)
data(rna)

#Randomly selecting the first 10/1000 genes
snv = sample(rownames(snv), 10)
de.up = sample(rownames(rna)[1:1000], 500)
de.down = sample(rownames(rna)[1001:2000], 500)

#Loading the default protein interaction data
data(ppi.database) 

#Loading dataframe used for converting IDs
data(id.conversion.set)

#Loading _fac_ which is a vector of all proteins existing inside _ppi.database_
data(fac) #vector of all proteins in ppi.database

abhac.brief.result = abhac.brief(de.up,de.down,fac=fac,snv=snv,
	enrichment.categories=c("snv.de","de.up"),
	ppi.database=ppi.database[,1:2],
	id.conversion.set=id.conversion.set)

If instead of particular selections of differentially expressed genes, we have an RPKM matrix of RNAseq or normalized mRNA expression values, AbHAC can find differentially expressed genes using EdgeR/limma. This is through the set.abhac function which accepts snv and rna matrices as input. The other important feature of this function is that you can provide subtype / phenodata of patients in a two column object called clinical.

#Loading example and default objects from the package
data(snv)
data(rna)
data(ppi.database) #2column whole human protein interaction database
data(id.conversion.set)
data(fac) #vector of all proteins in ppi.database


set.abhac.result = set.abhac(snv=snv,rna=rna,fac=fac,
   expression.method="Microarray",rna.paired=FALSE,
   fdr.cutoff=0.05,correction.method="BH",enrichment.categories=c("snv.de","de.up"),
   ppi.database=ppi.database[,1:2],id.conversion.set=id.conversion.set)

Important parameters

fisher.fdr : This parameter which is defaulted to using the permutation method described in the paper, can be set to any of the parameters accepted by p.adjust. The permutation based methods include Permutation.FDR and Permutation.FWER. if selecting any of these methods, other parameters described below would be important.

fisher.fdr.cutoff : By default is set to 0.05.

num.permuted.ppi: Number of permuted protein interaction networks to generate for multiple testing correction.

method.permuted.ppi: There are three options: AsPaper, ByDegree or equal.

AsPaper This method assumes all of the edge degrees with more than 4 proteins as a bin of proteins that proteins will be permuted inside that bin. However for edge degrees with less than 4 proteins, they all will be groupd into k bins defined by bins.permuted.ppi.
ByDegree Proteins will be grouped into k categories determined by bins.permuted.ppi without breaking the edge degrees. The bins created would have varying numbers; Some very low, some very high.
equal This would create bins with equal size of proteins. For bins that have thousand of proteins, it randomly distributes them to closest bins. It does this by ranking the proteins according to their edge degree and using the "random" method of ties.method in R rank function.

bins.permuted.ppi: Number of bins that proteins in the network are categorized into and then permuted within those bins. Read parameter specified by method.permuted.ppi to understand more.

Nomenclature:

In old irish, abhac means a dwarf star.

Maintainer: mehran dot karimzadeh at uhnresearch dot ca or mehran dot karimzadehreghbati at mail dot mcgil dot ca