Nowadays, the way to create a co-expression network is to use Hierarchical Clustering. This package allows you to create a co-expression network using the glmnet algorithm. In addition, the network created will be relative to a covariate of the sample to be studied, so we are creating a supervised coexpression network.
CovCoExpNets can be installed with the following command:
devtools::install_github("guillermo1996/CovCoExpNets")
Suppose we have an expression matrix, data
, with blood samples as
columns and genes as rows. Each sample is identified by the numerical
values taken by the genes. The particular covariate we are going to
study is the age of the samples’ donors, given in a numerical vector
denoted as age
.
The hub gene detection algorithm is based on Lasso repetitions, where
the random effects of highly correlated variables are reduced by
applying a certain number of repetitions t
. The genes selected as
relevant will constitute the final dataset to execute a Lasso
regularization again. From the final model generated, we extract the
relevant genes and their coefficients as their relevance to the given
covariate.
The process starts by preprocessing the data. The specific details of
this process are shown in its respective tutorial (Data
preparation). As a summary, the dataset is first
transformed to a logarithmic scale. Then, we require a minimum
activation of 0.1 in at least 80% of the samples, followed by the
removal of all non-protein coding genes and those with a low variation
across the samples. Lastly, we centralize and normalize the expression
matrix. All of these steps are encompassed in the dataPreprocess()
function, where every step can be individually modified or executed as
requested.
library(CovCoExpNets)
library(dplyr)
raw_data <- readRDS("raw_data.rds")
data <- dataPreprocess(raw_data, includeSexChromosomes = T)
As seen in the example, we can also filter by genes located in autosomal
chromosomes. The specific covariate to study, the age, also needs to be
normalized with the normalize()
function:
raw_age <- readRDS("raw_age.rds")
age <- normalize(raw_age)
m <- age$mean
d <- age$standard.deviation
age <- age$covariate
The normalize()
function also returns the mean and the standard
deviation of the vector. This data will be later used to restore the
predictions to an age scale.
The next step in the hub gene detection is to execute the Lasso
repetitions. The chosen Lasso implementation is the glmnet
algorithm.
With the function geneFrequency()
, we set the number of repetitions
(use t=10
as a baseline). We can also set the train/test split and the
initial seed to ensure reproducible results. We will then use the
function reduceGenes()
to specify the minimum relative frequency of
appearance that a gene must have to be selected for the final model
execution. In this example, we will set this hyperparameter to mrfa = 0.9
.
genes_freq <- geneFrequency(data, age, t = 10, train.split = 0.8, seed = 1796)
genes_subset <- reduceGenes(genes_freq, mrfa = 0.9, relative = T)
Once we have a first selection of relevant genes, we execute the
function glmnetGenesSubset()
to calculate the final model and extract
the relevant genes from
it.
glmnet_model <- glmnetGenesSubset(data, age, genes_subset, train.split = 0.8, seed = 1796, evaluate.model = T, m = m, d = d)
cvfit <- glmnet_model$cvfit
evaluation <- glmnet_model$evaluation
We set the same train.split
and seed
as for the glmnet
repetitions
to ensure that the same training dataset was used to generate the model.
Since we set the argument evaluate.model
to TRUE
, the returned
predictor will contain both the generated model and the evaluation
(i.e. RMSE). With information about the mean and standard deviation
provided, the results will be in
years.
evaluation %>% mutate(across(where(is.numeric), round, 3)) %>% knitr::kable()
Condition | rmse.test | rmse.train | rmse.bootstrap | rmse.null | Samples.train | Samples.test | Initial.predictors | Returned.predictors | r2.train | r2.test | r2_adj.train | r2_adj.test |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Condition | 5.676 | 0.322 | 3.706 | 9.004 | 204 | 51 | 15516 | 173 | 0.999 | 0.598 | 0.993 | 1.163 |
To extract the relevant genes from the model, we use the function
extractModelGenes
:
relevant_genes <- extractModelGenes(cvfit)
knitr::kable(relevant_genes %>% head(5))
Genes | Coefficients |
---|---|
EDA2R | 0.2877469 |
BAIAP2L2 | 0.1573385 |
ADRA2B | -0.1251515 |
GPR26 | -0.1232131 |
GFAP | 0.1187436 |
Other tutorials available to better understand how this package works are found in the docs directory:
This package is based on the package glmnet, available at the following URL: https://cran.r-project.org/web/packages/glmnet/glmnet.pdf. It is also a fork of the package SuCoNets, developed by Carmen María Hernandez. Both CovCoExpNets and SuCoNets have been supervised by both Juan A. Botía (Universidad de Murcia) (https://github.com/juanbot) and Alicia Gómez Pascual (Universidad de Murcia), who also contributed to its design.