/snbvbs

Code for "Scalable Bayesian Variable Selection Regression Models for Count Data", by Miao et al. (2019), in Flexible Bayesian Regression Modelling, Yanan F. et al (Eds), Elsevier, 187-219.

Primary LanguageC++

Scalable Bayesian Variable Selection for Negative Binomial Regression Models

Please cite the following paper if you find the package useful.

Miao, Y., Kook, J.H., Y. Lu, Guindani, M. and Vannucci, M. (2019). Scalable Bayesian Variable Selection Regression Models for Count Data. In Flexible Bayesian Regression Modelling, Yanan F., Smith M., Nott D. and Dortet-Bernadet J.-L.(Eds). Elsevier, 187-219.

Introduction

We focus on Bayesian variable selection methods for regression models for count data, and specifically on the negative binomial linear regression model. We first formulate a Bayesian hierarchical model with a variable selection spike-and-slab prior. For posterior inference, we review standard MCMC methods and investigate a computationally more efficient approach using variational inference. We also compare performance of the spike-and-slab prior versus an adaptive shrinkage prior such as the horseshoe prior.

The negative binomial regression is specified as the following given

$$
\begin{align}
\begin{split}
y_{i} \mid r, \psi_{i} &\sim\text{NB}\left(r,\frac{\exp (\psi_{i})}{1+\exp (\psi_{i})}\right), \\
\psi_{i}  & =\beta_{0}+{\boldsymbol{x}}_{i}^{T}{\boldsymbol{\beta}}. 
\end{split}
\end{align}
$$

We implemented the sparsity-inducing spike-and-slab prior and adaptive shrinkage horseshoe prior.

The hierarchical priors are specified for the spike-and-slab setting:

$$
\begin{align}
\beta_{k}\mid\gamma_{k} & \sim\gamma_{k}\underbrace{\text{Normal}\left(0,\sigma_{\beta}^{2}\right)}_{\text{slab}}+\left(1-\gamma_{k}\right)\underbrace{\delta_{0}}_{\text{spike}} && \text{ where }k=\left\{ 1,2,\cdots,p\right\}, \nonumber \\
\gamma_{k} & \sim\text{Bernoulli}\left(\pi\right) && \text{ where }\pi\in\left[0,1\right].  \nonumber\\
\end{align}
$$

The hierarchical priors are specified for the horseshoe setting:

$$
\begin{align}
\left[\beta_{k}\mid\lambda_{k}\right] & \overset{\text{indep}}{\sim}\text{Normal}\left(0,\lambda_{k}^{2}\right),\nonumber \\
\left[\lambda_{k}\mid A\right] & \overset{\text{iid}}{\sim}C^{+}\left(0,A\right),\nonumber\\
A & \sim\text{Uniform}\left(0,10\right).\nonumber 
\end{align}
$$

The priors on the other parameters are given as

$$
\begin{align}
\beta_{0} & \sim \text{Normal}\left(0,\tau_{\beta_{0}}^{-1}\right) && \text{ where }\tau_{\beta_{0}}^{-1}=\sigma_{\beta_{0}}^{2}, \\
r & \sim\text{Gamma}\left(a_{r},b_{r}\right), \nonumber\\
\sigma_{\beta}^{2}&\sim\text{Scaled-Inv-}\chi^{2}\left(\nu_{0},\sigma_{0}^{2}\right)\nonumber.
\end{align}
$$

The direct graph under the spike-and-slab setting is

NegBinGraph

Installation

Prepare environment

Our code is written in C/C++ and use the pybind11 to expose the C++ to Python. We use the Eigen and GSL packages for fast linear algebra and random sampling computation. Therefore, to use our code, we need to install a few dependent software and set up the necessary environment. Here is a list of software that need to install before running our code.

Configuring those environment correctly can be a pain. But don't worry, I will walk you through the process in detail. For the following tutorial, I will use Ubuntu as an example. For the other Linux based system such as macOS or Red Hat, the process should be similar. But the process for setting up the environment in Windows is more confusing and you might need to install Microsoft Visual Studio and its package manager vcpkg. However, I am currently working on an R package that can hopefully solve the compatibility issues with the Windows. Please visit us again and check our R package in the future.

Install Python

You can install either Python 2 or Python 3 using Anaconda or Miniconda.

Install the pybind11

pip install pybind11

Install gcc and g++

sudo apt-get install build-essential

Install git

sudo apt install git

Install cmake Check the answer here for an alternative way.

sudo apt-get install software-properties-common
sudo add-apt-repository ppa:george-edison55/cmake-3.x
sudo apt-get update
sudo apt-get install cmake

Install GSL

sudo apt-get install libgsl-dev

Install Eigen

sudo apt-get install libeigen3-dev

Compile csnbvbs

cmake . -DCMAKE_BUILD_TYPE=Release
make

You will find a file named csnbvbs.cpython-36m-x86_64-linux-gnu.so in the current directory. Copy the csnbvbsfile to the directory that you are working with. You are ready to import it as a regular python package. Congratulations!

Example

There are four modules from this package:

  • NegBinHS (MCMC version with the horseshoe prior)
  • NegBinSSMCMC (MCMC version with the spike-and-slab prior)
  • NegBinSSVIEM (Variaional Inference EM with the spike-and-slab prior)
  • parNegBinSSVIIS (Variaional Inference EM and Importance Sampling with the spike-and-slab prior)

You can copy the csnbvbs file to the scripts folder and perform the following benchmark testing among four methods considered:

Simulate n = 200, p = 50 with various model sparsity level and features correlation measure rho.

python py_simulation.py

After running the py_simulation.py, you will find a simulation folder where there are 100 examples per folder with different rho. You then can perform benchmark study by running the following python scripts:

python py_ss_mcmc_benchmark.py # spike and spike mcmc sampling
python py_hs_benchmark.py      # horseshoe mcmc sampling
pyhton py_ss_viem_benchmark.py # variational inference EM
pyhton py_ss_viss_benchmark.py # variational inference EM with importance sampling