bayesmix: A C++ repository from JoaoHenriqueOliveira

bayesmix is a C++ library for running MCMC simulation in Bayesian mixture models.

Current state of the software:

bayesmix performs inference for mixture models of the kind

$y_1, \ldots, y_n \sim \int k(\cdot \mid \theta) \Pi(d\theta)$

$\Pi \sim P$

Where P is either the Dirichlet process or the Pitman--Yor process.

We currently support univariate and multivariate location-scale mixture of Gaussian densities
Inference is carried out using either Algorithm 2 or Algorithm 8 in Neal (2000).
Serialization of the MCMC chains is possible using Google's protocol buffers

Installation:

We heavily depend on Google's Protocol Buffers, so make sure to install it beforehand!

On Linux machine the following will install the library

sudo apt-get install autoconf automake libtool curl make g++ unzip
wget https://github.com/protocolbuffers/protobuf/releases/download/v3.14.0/protobuf-python-3.14.0.zip
unizp protobuf-python-3.14.0.zip
cd protobuf-3.14.0/
./configure --prefix=/usr
make check
sudo make install
sudo ldconfig # refresh shared library cache.

On Mac and Windows machines, follow the official install guide (link)

Finally, to work with bayesmix just clone the repository with

git clone --recurse-submodule git@github.com:bayesmix-dev/bayesmix.git

To run the executable:

mkdir build
cd build
cmake ..
make run
cd ..
./build/run

To run unit tests:

cd build
cmake ..
make test_bayesmix
./test/test_bayesmix

For Developers

Please install the pre-commit hooks before commiting anything: it clears the output of jupyter notebooks. Just type

./bash/setup_pre_commit.sh

Future steps (contributors are welcome!)

A Python package is already under development

Extension to normalized random measures
Using HMC / MALA MCMC algorithm to sample from the cluster-specific full conditionals when it's not conjugate to the base measure
R package

Cluster estimate

This library provides a cluster estimates computation, given a mcmc chains. It is based on expected posterior loss minimisation given a loss function and using a greedy algorithm. Sources files are in the folder src/clustering.

To run the code :

cd build
cmake ..
make run_pe
./run_pe filename_in filename_out loss Kup

where :

filename_in is the entry filename that contains mcmc chain (a file in which values are separated with spaces)
filename_out is the out filename in which cluster estimate will be writen
loss is the specification of the loss function : 0 for binder loss, 1 for variation of information, 2 for normalized variation of information
Kup is the max number of clusters (usually Kup=N is a good entry if dataset has a length of N)

Credible balls computation is also available. This aims to quantify the uncertainty of a cluster estimate. To run the credible balls code :

cd build
cmake ..
make run_cb
./run_cb filename_mcmc filename_pe filename_out loss rate

where :

filename_mcmc is the filename in which there is the mcmc chain.
filename_pe is the filename in which there is the cluster estimate.
filename_out is the filename in which result will be writen
loss is the specification of the loss function : 0 for binder loss, 1 for variation of information, 2 for normalized variation of information
rate : has to be > 0. The smaller it is, the longer will run the program.

The directory src/clustering/R scripts contains some scripts to generate mcmc chains for univariate and multivariate datasets.

JoaoHenriqueOliveira/bayesmix

Installation:

For Developers

Future steps (contributors are welcome!)

Cluster estimate