We have made extensions to the GpGp
R package, which have not been pulled into the main GpGp repo. To install our version of GpGp
package with the necessary extensions, please clone and install the squared_relevance
branch of this repository.
To reproduce the figures and results in Cao, Guinness, Genton, Katzfuss (2022), please enter the corresponding directory and run the .R
or .Rmd
file.
For GP_approx_cmp
, first run Vecchia_FIC_PIC_compare.Rmd
until the simulation
Section, then run gpflow_SVGP.ipynb
, and finally run the simulation
and plot
Section of Vecchia_FIC_PIC_compare.Rmd
because the FITC method is implemented in the GPytorch
python package.
For real_dataset
, the raw data should be first downloaded (e.g., from UCI) to the data
folder. Then, run the corresponding section of dataset_prep.Rmd
to pre-process the data. Finally, change the datasetName
variable in the main.Rmd
to run the real-data application study.
For the temperature_data
, leading spaces should first be removed with a text editor.
For the sparse additive model implemented in SAM
R package, the input is jittered slightly. Otherwise, some datasets, e.g., the slicing dataset, reported error.
Clean the environment.
rm(list = ls())
Install a modified version of the ''GpGp'' R package.
devtools::install_github("https://github.com/katzfuss-group/GpGp.git",
ref = "squared_relevance")
Load needed libraries and R files,
source("func/misc.R", chdir = T)
source("func/FBselect.R", chdir = T)
source("func/stop_cond.R", chdir = T)
source("func/locs_gen.R", chdir = T)
source("func/y_gen.R", chdir = T)
source("func/Vecc_path.R", chdir = T)
source("func/penalty.R", chdir = T)
Generate locs
. locs
should be a matrix whose rows are covariate (feature) vectors of the responses. n
is the number of reponses and d
is the number of covariates.
set.seed(123)
n <- 6400
d <- 1e3
locs <- locs_gen_idp(n = n, d = d)
Generate y
. y
should be a vector representing the responses. sigmaSq0
, tauSq0
, and rSq0
are the true parameter values in the Matern kernel with smoothness 2.5.
set.seed(123)
sigmaSq0 <- 1
tauSq0 <- 0.05^2
rSq0 <- c(10, 5, 2, 1, 0.5, rep(0, d - 5))^2
covFn <- "matern25_scaledim_sqrelevance"
y <- MVN_gen(locs = locs, parms = c(sigmaSq0, rSq0, tauSq0), covFn = covFn)
y <- y - mean(y)
Define the ratio of in-sample data to total data.
pIn <- 0.75
spltObj <- splt_locs_y(locs, y, pIn)
locsTrn <- spltObj$locs1
locsTst <- spltObj$locs2
yTrn <- spltObj$y1
yTst <- spltObj$y2
Define penalty function, the iterative adapative bridge penalty. dpen_fun
and ddpen_fun
are first and second order information.
pen_fun <- pen_avg_brdg
dpen_fun <- dpen_avg_brdg
ddpen_fun <- ddpen_avg_brdg
Optimization parameters. minPosi
is a small positive number to prevent var
, nugget
and the sum of sr
reaching zero. k
is the number of covariates selected each time. miniQCCD
and miniGrad
are batch sizes used in QCCD and variable selection. nAvg
is the kappa in the penalty function. m
is the conditioning set size in Vecchia approximation. mini
is whether to use mini-batch subsampling or not. taper
is whether to use the iterative adaptive bridge penalty or the bridge penalty. lambVec
stores the lambda values for model estimation on the regularization path. stop_con_path
is the stopping conditions based on 1% improvement of the OOS score and the inclusion of new covariate.
var0 <- 1
tau0 <- 1e-4
sr0 <- 1e-2
minPosi <- 1e-8
m <- 100
k <- 3
miniQCCD <- 128
miniGrad <- 128
nAvg <- 2
lambVec <- 2^(rev(-6 : floor(log2(n))))
convQCCD <- 1e-4
convCCD <- 1e-4
cAmij <- 1e-4
maxIterQCCD <- 200
maxIterCCD <- 40
covFn <- "matern25_scaledim_sqrelevance"
mini <- T
taper <- T
silent <- F
stop_con_path <- function(scrVec, idxSet, thetaSet){
min(stop_con_NEW_path(scrVec, idxSet, thetaSet),
stop_con_OOS1_path(scrVec, idxSet, thetaSet))
}
Call the function Vecc_path
for the ''VGPR'' algorithm. pIn
here specifies the ratio of the validation dataset to total training dataset, specifically, 1 - pIn
. OOS_rmse
is the function for computing the OOS RMSE score. stop_con_OOS1_fb
is the stopping conditions based on 1% improvement of the OOS score.
bgnTime <- Sys.time()
VeccObj <-
Vecc_path(
var0 = var0,
tau0 = tau0,
sr0 = sr0,
locs = locsTrn,
y = yTrn,
m = m,
k = k,
lambVec = lambVec,
pen_fun = pen_fun,
dpen_fun = dpen_fun,
ddpen_fun = ddpen_fun,
OOS_score = OOS_rmse,
stop_con_path = stop_con_path,
stop_con_fb = stop_con_OOS1_fb,
miniQCCD = miniQCCD,
miniGrad = miniGrad,
pIn = pIn,
spltSeed = 123,
convQCCD = convQCCD,
convCCD = convCCD,
cAmij = cAmij,
maxIterQCCD = maxIterQCCD,
maxIterCCD = maxIterCCD,
covFn = covFn,
minPosi = minPosi,
mini = mini,
taper = taper,
silent = silent,
nAvg = nAvg
)
endTime <- Sys.time()
timeUsed <- as.numeric(endTime - bgnTime, units = "secs")
cat("Vecchia solution path used", timeUsed, "seconds\n")
Evaluate posterior prediction performance.
VeccScr <- OOS_rmse(VeccObj$theta, locsTrn, locsTst, yTrn, yTst, m, covFn)
cat("Vecchia solution path selected: ", VeccObj$idx, "covariates\n")
cat("Vecchia solution path estimated theta:",
VeccObj$theta[c(1, VeccObj$idx + 1, d + 2)], "\n")
cat("Vecchia solution path RMSE score:", VeccScr, "\n")
cat("Vecchia solution path FPR:",
length(setdiff(VeccObj$idx, 1 : 5)) / length(VeccObj$idx), "\n")
Cao, Guinness, Genton, and Katzfuss (2022). Scalable Gaussian-process regression and variable selection using Vecchia approximations.