This repository contains Python code for Bayesian Nonparametric Learning with a Dirichlet process prior. More details can be found in the paper below:
Fong, E., Lyddon, S. and Holmes, C. Scalable Nonparametric Sampling from Multimodal Posteriors with the Posterior Bootstrap. In Proceedings of the 36th International Conference on Machine Learning (ICML), 2019. https://arxiv.org/abs/1902.03175
To install the npl package, clone the repository and run
python3 setup.py develop
Although the setup installs the packages automatically, you may need to install pystan
separately using pip
if setuptools
isn't working correctly. Please make sure the version of pystan
is newer than v2.19.0.0 or the evaluate scripts may not work properly. The code has been tested on Python 3.6.7.
- Current implementation will use all cores available on the local computer. If this is undesired, pass the number of cores as
n_cores
to the functionbootstrap_gmm
orbootstrap_logreg
in the run scripts,. - If running on multi-core computer, make sure to restrict
numpy
to use 1 thread per process forjoblib
to parallelize without CPU oversubscription, with the bash command:export OPENBLAS_NUM_THREADS=1
A directory overview is given below:
-
npl
- Contains main functions for the posterior bootstrap and evaluating posterior samples on test databootstrap_logreg.py
andbootstrap_gmm.py
contain the main posterior bootstrap sampling functions for generating the randomized weights and parallelizing.maximise_logreg.py
andmaximise_gmm.py
contain functions for sampling the prior pseudo-samples, initialising random restarts and maximising the weighted log likelihood. These functions can be edited to use NPL with different models and priors../evaluate
contains functions for calculating log posterior predictives of the different posteriors.
-
experiments
- Contains scripts for running main experiments -
supp_experiments
- Contains scripts for running supplementary experiments
- Download MNIST files from http://yann.lecun.com/exdb/mnist/.
- Extract and place in
./samples
, so the folder contains the files:
t10k-images-idx3-ubyte
t10k-labels-idx1-ubyte
train-images-idx3-ubyte
train-labels-idx1-ubyte
- Download the Adult, Polish companies bankruptcy 3rd year, and Arcene datasets from UCI Machine Learning Repository, links below:
- Adult - https://archive.ics.uci.edu/ml/datasets/adult
- Polish- https://archive.ics.uci.edu/ml/datasets/Polish+companies+bankruptcy+data
- Arcene- https://archive.ics.uci.edu/ml/datasets/Arcene
- Extract and place all data files in
./data
, so the folder contains the files:
3year.arff
adult.data
adult.test
arcene_train.data
arcene_train.labels
arcene_valid.data
arcene_valid.labels
- Run
generate_gmm.py
to generate toy data. The files in./sim_data_plot
are the train/test data used for the plots in the paper, and the files in./sim_data
are the datasets for the tabular results. - Run
run_NPL_toygmm.py
for the NPL example andrun_stan_toygmm.py
for the NUTS and ADVI examples. - Run
evaluate_posterior_toygmm.py
to evaluate posterior samples. The Jupyter notebookPlot bivariate KDEs for GMM.ipynb
can be used to produce posterior plots.
- Run
run_NPL_MNIST.py
for the NPL example andrun_stan_MNIST.py
for the NUTS and ADVI examples. - Run
evaluate_posterior_MNIST.py
to evaluate posterior samples. The Jupyter notebookPlot MNIST KDE.ipynb
can be used to produce posterior plots.
- Run
load_data.py
to preprocess data and generate different train-test splits. - Run
run_NPL_logreg.py
for the NPL example andrun_stan_logreg.py
for the NUTS and ADVI examples. - Run
evaluate_posterior_logreg.py
to evaluate posterior samples. The Jupyter notebookPlot marginal KDE (for Adult).ipynb
can be used to produce posterior plots.
- Covariate data is not included for privacy reasons. Run
load_data.py
to generate simulated covariates from Normal(0,1) (uncorrelated unlike real data) and pseudo-phenotypes. - Run
run_NPL_genetics.py
for the NPL example. - The Jupyter notebook
Plotting Sparsity Plots.ipynb
can be used to produce sparsity plots.
- The Jupyter notebook
Normal location model.ipynb
contains all experiments and plots.
- Run
generate_gmm.py
to generate toy data. The files in./sim_data_plot
are the train/test data used for the plots in the paper. - Run
run_NPL_toygmm.py
for the NPL example (note that the MDP example will be run too) andrun_IS_toygmm.py
for the importance sampling example. - Run
evaluate_posterior_toygmm.py
to evaluate posterior samples on test data.
- Run
generate_gmm.py
to generate toy data. The files in./sim_data_plot
are the train/test data used for the plots in the paper, and the files in./sim_data
are for the tabular results. - First run
run_stan_toygmm
to generate the NUTS (required for MDP-NPL) and ADVI samples, then runrun_NPL_toygmm.py
for MDP-NPL and DP-NPL (note that the IS example will be run too). - Run
evaluate_posterior_toygmm.py
to evaluate posterior samples on test data. The Jupyter notebookPlot bivariate KDEs for GMM.ipynb
can be used to produce posterior plots.