The following scripts illustrate how to fit some of the models presented in Vazquez et al., Genetics, 2016, the scripts are also provided at: https://github.com/anainesvs/VAZQUEZ_etal_GENETICS_2016, please refer to that webpage for updates.
Contact: avazquez@msu.edu
The code below illustrates how to install and load the necessary package from CRAN using install.packages()
.
install.packages(pkg='BGLR') # install BGLR
library(BGLR);
Data. The code assumes that the user has saved in the file OMIC_DATA.rda
the objects that contain the phenotypic information, clinical covariates, and omic data. The code assumes that the file OMIC_DATA.rda
contain the following objects:
XF
: an incidence matrix for clinical covariates.Xge
: an incidence matrix for gene expression.Xmt
: an incidence matrix for methylation values at various sites.y
: a vector with the response, in this case a 0/1 where 0 denotes alive. The code below assumes that all the predictors were edited by removing outliers and predictors that did not vary in the sample, transformed if needed, and missing values were imputed.
Some of the models fitted in the study use similarity matrices of the form G=XX' computed from omics. The following code illustrates how to compute this matrix for gene expression. A similar code could be use to compute a G-matrix for methylation or other omics (see (6)).
load('OMIC_DATA.rda')
#Computing a similarity matrix for gene-expression data
Xge<- scale(Xge, scale=true, center=TRUE) #centering and scaling
Gge<-tcrossprod(Xge) #computing crossproductcts
Gge<-Gge/mean(diag(Gge) #scales to an average diagonal value of 1.
NOTE: for larger data sets it may be more convenient to use the getG()
function of the BGData R-package. This function allows computing G without loading all the data in RAM and offers methods for multi-core computing.
The following code illustrates how to use BGLR to fit a fixed effects model. The matrix XF is an incidence matrix for clinical covariates. There is no column for intercept in XF because BGLR adds the intercept automatically. The response variable y
is assumed to be coded with two labels (e.g., 0/1), the argument response_type
is used to indicate to BGLR that the response is ordinal (the binary case is a special case with only two levels). Predictors are given to BGLR in the form a two-level list. The argument save_at
can be used to provide a path and a prefix to be added to the files saved by BGLR. For further details see Pérez-Rodriguez and de los Campos, Genetics, 2014. The code also shows how to retrieve estimates of effects and of success probabilities. In the examples below we fit the model using the default number of iterations (1,500) and burn-in (500). In practice longer chains are needed; the user can increase the number of iterations or the burn-in using the arguments nIter
and burnIn
of BGLR
.
# Inputs
XF<- scale(XF, scale=FALSE, center=TRUE) # centering and scaling the incidence matrix for fixed effects.
ETA.COV<-list( COV=list(X=XF, model='FIXED') )
# Fitting the model
fm=BGLR(y=y, ETA=ETA.COV, saveAt='cov_', response_type='ordinal')
# Retrieving estimates
fm$ETA$COV$b # posterior means of fixed effects
fm$ETA$COV$SD.b # posteriro SD of fixed effects
head(fm$probs) # estimated probabilities for the 0/1 outcomes.
(5) Fitting a binary model for fixed effects and whole genome gene expression (GE) using BGLR (COV+GE)
The following code illustrates how to use BGLR to fit a mixed effects model that accomodates both clinical covariates and whole-genome-gene expression.
# Setting the linear predictor
ETA.COV.GE<-list( COV=list(X=XF, model='FIXED'), GE=list(K=Gge, model='RKHS'))
# Fitting the model
fm.COV.GE<- BGLR(y=y, ETA=ETA.COV.GE, response_type='ordinal',saveAt='cov_ge_')
# Retrieving predictors
fm.COV.GE$mu # intercept
fm.COV.GE$ETA$COV$b # effects of covariates
fm$COV.GE$ETA$GE$varU # variance associated to GE SD.varU gives posterior SD
fm.COV.GE$ETA$GE$u # random effects associated to gene expression
plot(scan('cov_ge_ETA_GE_varU.dat'),type='o',col=4) # trace plot of variance of GE.
NOTE: to fit a similar model for COV+METH one just needs to change the inputs in the definition of the linear predictor by providing Gmt instead of Gge.
The following code shows how to extend the the model COV+GE
with addition of methylation data.
#Computing a similarity matrix for methylation data
Xmt<- scale(Xmt, scale=TRUE, center=TRUE) #centering and scaling
Gmt<-tcrossprod(Xmt) #computing crossproductcts
Gmt<-Gmt/mean(diag(Gmt) #scales to an average diagonal value of 1.
ETA.COV.GE.MT<-list( COV=list(X=XF, model='FIXED'),
GE=list(K=Gge, model='RKHS'),
METH=list(K=Gmt, model='RKHS'))
# Fitting models
fm.COV.GE.MT<- BGLR(y=y, ETA=ETA.COV.GE.MT,
response_type='ordinal',saveAt='cov_ge_mt_')
(7) Fitting a binary model for fixed effects covariates and 2 omics and their interactions (COV+GE+METH+GExMETH)
The following code shows how to extend the the model COV+GE+METH
with addition of interactions between gene expression and methylation profiles.
G.mg=Gmt*Gge
G.mg=G.mg/mean(diag(G.mg))
ETA.COV.GE.MT.GExMT<-list( COV=list(X=XF, model='FIXED'),
GE=list(K=Gge, model='RKHS'),
METH=list(K=Gmt, model='RKHS'),
GExMETH=list(K=G.mg, model='RKHS'))
# Fitting models
fm.COV.GE.MT.GExMT<- BGLR(y=y, ETA=ETA.COV.GE.MT.GExMT,
response_type='ordinal',saveAt='cov_ge_mt_gexmt')
The following illustrates how to select a validation set using the model COV
as example.
#Installing and loading library pROC to compute Area Under the ROC Curve.
install.packages(pkg='pROC') # install pROC
library(pROC);
n <- length(y)
# Randomly select a 20% of the data to be the testing set
tst<- runif(n) <0.2
yNA = y; yNA[tst] <-NA
# Fit the model only in the training set
fm.COVtr<- BGLR(y=yNA, ETA=ETA.COV, response_type='ordinal')
# Find probability of survival for the testing set
pred <-fm.COV2$probs[tst,2]
# Estimate AUC
AUC_train<-auc(y[-tst],fm.COV2$yHat[-tst])
AUC_test<-auc(y[tst], pred)
#For the first individual, area under the standard normal curve (CDF) of estimated y from full model:
pnorm(fm.COV$yHat[1])
NOTE: if sample size is small (like TCGA data) and uneven in the number of 1s and 0s it will be wise to randomize 1s and 0s to be part of the testing sets, and repeate the validation multiple times. In Vazquez et al., 2016 (Genetics) we implement 200 cross-validations.