/bayesmix

Primary LanguageC++BSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

bayesmix is a C++ library for running MCMC simulation in Bayesian mixture models.

Current state of the software:

  • bayesmix performs inference for mixture models of the kind

Where P is either the Dirichlet process or the Pitman--Yor process.

  • We currently support univariate and multivariate location-scale mixture of Gaussian densities

  • Inference is carried out using either Algorithm 2 or Algorithm 8 in Neal (2000).

  • Serialization of the MCMC chains is possible using Google's protocol buffers

Installation:

We heavily depend on Google's Protocol Buffers, so make sure to install it beforehand!

On Linux machine the following will install the library

sudo apt-get install autoconf automake libtool curl make g++ unzip
wget https://github.com/protocolbuffers/protobuf/releases/download/v3.14.0/protobuf-python-3.14.0.zip
unizp protobuf-python-3.14.0.zip
cd protobuf-3.14.0/
./configure --prefix=/usr
make check
sudo make install
sudo ldconfig # refresh shared library cache.

On Mac and Windows machines, follow the official install guide (link)

Finally, to work with bayesmix just clone the repository with

git clone --recurse-submodule git@github.com:bayesmix-dev/bayesmix.git

To run the executable:

mkdir build
cd build
cmake ..
make run
cd ..
./build/run

To run unit tests:

cd build
cmake ..
make test_bayesmix
./test/test_bayesmix

For Developers

Please install the pre-commit hooks before commiting anything: it clears the output of jupyter notebooks. Just type

./bash/setup_pre_commit.sh

Future steps (contributors are welcome!)

A Python package is already under development

  • Extension to normalized random measures
  • Using HMC / MALA MCMC algorithm to sample from the cluster-specific full conditionals when it's not conjugate to the base measure
  • R package

Cluster estimate

This library provides a cluster estimates computation, given a mcmc chains. It is based on expected posterior loss minimisation given a loss function and using a greedy algorithm. Sources files are in the folder src/clustering.

To run the code :

cd build
cmake ..
make run_pe
./run_pe filename_in filename_out loss Kup

where :

  • filename_in is the entry filename that contains mcmc chain (a file in which values are separated with spaces)
  • filename_out is the out filename in which cluster estimate will be writen
  • loss is the specification of the loss function : 0 for binder loss, 1 for variation of information, 2 for normalized variation of information
  • Kup is the max number of clusters (usually Kup=N is a good entry if dataset has a length of N)

Credible balls computation is also available. This aims to quantify the uncertainty of a cluster estimate. To run the credible balls code :

cd build
cmake ..
make run_cb
./run_cb filename_mcmc filename_pe filename_out loss rate

where :

  • filename_mcmc is the filename in which there is the mcmc chain.
  • filename_pe is the filename in which there is the cluster estimate.
  • filename_out is the filename in which result will be writen
  • loss is the specification of the loss function : 0 for binder loss, 1 for variation of information, 2 for normalized variation of information
  • rate : has to be > 0. The smaller it is, the longer will run the program.

The directory src/clustering/R scripts contains some scripts to generate mcmc chains for univariate and multivariate datasets.