The conditional dependence structure (CDS) of a data matrix with
There is extensive GGM literature and many R packages for GGM, however,
all make the restrictive assumption that the precision matrix is
homogeneous throughout the data, or that there exists a partition of
homogeneous subgroups. covdepGE
avoids this strong assumption by
utilizing information sharing to model the CDS as varying continuously
with an extraneous covariate. Intuitively, this implies that
observations having similar extraneous covariate values will have
similar precision matrices.
To facilitate information sharing while managing complexity, covdepGE
uses an efficient variational approximation conducted under the novel
weighted pseudo-likelihood framework proposed by (1). covdepGE
further
accelerates inference by employing parallelism and executing expensive
iterative computations in C++. Additionally, covdepGE
offers a
principled, data-driven approach for hyperparameter specification that
only requires the user to input data and extraneous covariates to
perform inference. Finally, covdepGE
offers several wrappers around
ggplot2
for seamless visualization of resulting estimates, such as
matViz
, inclusionCurve
, and the S3 method plot.covdepGE
.
Basic usage of covdepGE
is demonstrated below in the Example
section using simulated data, while
https://github.com/JacobHelwig/covdepGE/blob/master/examples/TCGA_analysis.md
demonstrates usage on real-world data.
You can install the released version of covdepGE from CRAN with:
install.packages("covdepGE")
And the development version from GitHub with:
# install.packages("devtools")
devtools::install_github("JacobHelwig/covdepGE")
If you have an idea to improve covdepGE
, considering forking the
repo and creating a pull
request or opening an
issue.
library(covdepGE)
#> Warning: package 'covdepGE' was built under R version 4.2.2
library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 4.2.3
# get the data
set.seed(12)
data <- generateData()
X <- data$X
Z <- data$Z
interval <- data$interval
prec <- data$true_precision
# get overall and within interval sample sizes
p <- ncol(X)
n <- nrow(X)
n1 <- sum(interval == 1)
n2 <- sum(interval == 2)
n3 <- sum(interval == 3)
# visualize the distribution of the extraneous covariate
ggplot(data.frame(Z = Z, interval = as.factor(interval))) +
geom_histogram(aes(Z, fill = interval), color = "black", bins = n %/% 5)
# visualize the true precision matrices in each of the intervals
# interval 1
matViz(prec[[1]], incl_val = TRUE) +
ggtitle(paste0("True precision matrix, interval 1, observations 1,...,", n1))
# interval 2 (varies continuously with Z)
cat("\nInterval 2, observations ", n1 + 1, ",...,", n1 + n2, sep = "")
#>
#> Interval 2, observations 61,...,120
int2_mats <- prec[interval == 2]
int2_inds <- c(5, n2 %/% 2, n2 - 5)
lapply(int2_inds, function(j) matViz(int2_mats[[j]], incl_val = TRUE) +
ggtitle(paste("True precision matrix, interval 2, observation", j + n1)))
#> [[1]]
#>
#> [[2]]
#>
#> [[3]]
# interval 3
matViz(prec[[length(prec)]], incl_val = TRUE) +
ggtitle(paste0("True precision matrix, interval 3, observations ",
n1 + n2 + 1, ",...,", n1 + n2 + n3))
# fit the model and visualize the estimated graphs
(out <- covdepGE(X, Z, parallel = T, num_workers = p))
#> Warning in covdepGE(X, Z, parallel = T, num_workers = p): No registered workers
#> detected; registering doParallel with 5 workers
#> Covariate Dependent Graphical Model
#>
#> ELBO: -171501.68 # Unique Graphs: 3
#> n: 180, variables: 5 Hyperparameter grid size: 125 points
#> Model fit completed in 2.35 secs
plot(out)
#> [[1]]
#>
#> [[2]]
#>
#> [[3]]
# visualize the posterior inclusion probabilities for variables (1, 3) and (1, 2)
inclusionCurve(out, 1, 2)
inclusionCurve(out, 1, 3)
Math is correctly rendered in the version of this document available on CRAN.
Suppose that
Given data satisfying these assumptions, the covdepGE
function employs
the algorithm described in (1) to estimate a graphical representation of
the structure of
Graphs are constructed using a pseudo-likelihood approach by fixing each
of the columns sym_method
(e.g., by taking the mean of
edge_threshold
, an edge will be included between
To model
Spike-and-slab posterior quantities are estimated using a block
mean-field variational approximation. Coordinate Ascent Variational
Inference (CAVI) is performed for each of the weighted regressions to
select the variational parameters that maximize the ELBO. The parameters
for each of the regression coefficients are the mean and variance of the
slab (
CAVI for the alpha_tol
or after max_iter
iterations of CAVI have been performed.
Note that since the regressions performed for variable parallel = T
. Registering parallel
backend with greater than
Each regression requires the specification of covdepGE
offers hp_method
argument: grid_search
, model_average
, and hybrid
.
Empirically, grid_search
offers the best sensitivity and
model_average
offers the best specificity, while hybrid
sits between
the other two methods in both metrics.
The hyperparameter candidate grid is generated by taking the Cartesian
product between ssq
, sbsq
, and pip
(candidate values for
In grid_search
, the point from the grid that produces the model that
has the greatest total ELBO is selected, where the total ELBO is
calculated by summing the ELBO for each of the
Instead of selecting only one model as in grid_search
, models are
averaged over in model_average
. With
Finally, hybrid
combines grid_search
and model_average
. Fixing
pip
, the point in
the grid defined by the Cartesian product of ssq
and sbsq
is
selected by maximizing the total ELBO summed across the
Note that in the search step of grid_search
and hybrid
, CAVI for
each of the grid points is performed for at most max_iter_grid
iterations. A second CAVI is then performed for max_iter
iterations
using the max_iter_grid
to be less than max_iter
(as is the default)
will result in a more efficient search.
The candidate grids (ssq
, sbsq
, and pip
) may be passed as
arguments, however, by default, these grids are generated automatically.
Each of the grids are spaced uniformly between an upper end point and
lower end point. The number of points in each grid is nssq
, nsbsq
, and npip
. The lower
endpoints (ssq_lower
, sbsq_lower
, and pip_lower
) are all 1e-5
by
default. The upper endpoints are calculated dependent on the variable
ssq_upper
is simply the variance of ssq_mult
. By
default, ssq_mult
is 1.5
.
pip_upper
is calculated by regressing the remaining variables on lambda.1se
. The number of non-zero coefficients estimated by LASSO is
then divided by p - 1
to calculate pip_upper
. Note that if the LASSO
estimate to the number of non-zero coefficients is pip_upper
is greater than
Finally, an upper bound is induced on pip_upper
and
sbsq_upper
gives an upper bound on the
signal-to-noise ratio. Setting this bound equal to snr_upper
gives an
expression for sbsq_upper
.
The similarity weight for observation norm
argument,
tau
may be passed as an argument, however, by default, it is estimated
using the methodology given in (2). (2) describes a two-step approach
for density estimation, where in the first step, an initial estimate is
calculated using Silverman’s rule of thumb for initializing bandwidth
values, and in the second step, the density is refined by updating the
bandwidth values. This methodology is used here to estimate the density
of tau
.
-
Dasgupta, Sutanoy, Peng Zhao, Jacob Helwig, Prasenjit Ghosh, Debdeep Pati, and Bani K. Mallick. “An Approximate Bayesian Approach to Covariate-dependent Graphical Modeling.” arXiv preprint arXiv:2303.08979 (2023).
-
Dasgupta, Sutanoy, Debdeep Pati, and Anuj Srivastava. “A two-step geometric framework for density modeling.” Statistica Sinica 30, no. 4 (2020): 2155-2177.