rcgpar - Fit mixture models in HPC environments

rcgpar provides MPI and OpenMP implementations of a variational inference algorithm for estimating mixture model components from a likelihood matrix in parallel.

Installation

Compiling from source

Clone the repository to a suitable folder, enter the directory and run

mkdir build
cd build

... and follow the instructions below.

OpenMP

in the build/ directory, run

cmake ..
make

creating the librcgomp library in build/lib/.

MPI

You will need to use the appropriate platform-specifc commands to set up your MPI environment. For example, to set up rcgpar using Open MPI enter the build/ directory and run

module load mpi/openmp
cmake -DCMAKE_ENABLE_MPI_SUPPORT=1 -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx ..
make

creating the librcgmpi library in build/lib/. If OpenMP is also supported, the librcgomp library will also be created.

librcgmpi is compiled by default to support up to 1024 processes. If you need more, recompile the project with -DCMAKE_MPI_MAX_PROCESSES=<big number> added to the cmake command.

Hybrid OpenMP + MPI

librcgmpi automatically provides hybrid OpenMP + MPI parallelization when the library is compiled on a system that supports both protocols.

Compiling and running tests

rcgpar uses the googletest framework to test the libraries. Tests can be built by compiling the program in debug mode by appending the -DCMAKE_BUILD_TESTS=1 flag to the cmake call. Tests will be created in build/bin/ and all tests (except the MPI test) can be run from the runUnitTests executable.

Note: you will need to use mpirun (or some other appropriate call) to run the MPI test from the executable runMpiTest.

Usage

Simply include the rcgpar.hpp header in your project. This header provides two functions: 'rcgpar::rcg_optl_omp' for OpenMP parallelization and 'rcgpar::rcg_optl_mpi' for MPI (+OpenMP, if enabled) parallelization.

rcg_optl_omp and rcg_optl_mpi

These two functions perform the actual model fitting. Both 'rcg_optl_omp' and 'rcg_optl_mpi' have to be called with the following arguments:

const rcgpar::Matrix<double> &logl:
    KxN row-major order matrix containing the log-likelihoods for theobservations,
    where K is the number of components and N is the number of observations.
const std::vector<double> &log_times_observed:
    N-dimensional vector which contains the natural logarithm of the number
	of times that the N:th row in `logl` should be counted. Useful if many
	rows in the log-likelihood matrix are identical - they can be compressed
	by counting them several times via this argument.
const std::vector<double> &alpha0:
    N-dimensional vector containing the prior parameters of the Dirichlet
	distribution that is used as a conjugate prior in the model. Good
	default choice is to set all entries to 1.
const double &tol:
    The estimation process will terminate once the evidence lower bound
	ELBO changes by less than this value from one iteration to the next.
	Good choices are around 1e-6 and 1e-8, adjust according to your needs.
const uint16_t maxiters:
    Maximum number of iterations to run the optimizer for if the tolerance
	criterion is not fulfilled.
std::ostream &log:
    Print status messages here. Silence the messages by supplying a
	std::ofstream that has not been assigned to any file.

The optimizers return a KxN rcgpar::Matrix<double> type row-major order matrix, where each row is a probability vector assigning the row to the mixture components.

Note: rcg_optl_mpi assumes that the root process holds the full 'logl' and 'log_times_observed values', which are then distributed from the root process to other processes. Contrary to this, 'alpha0', 'tol', and 'maxiters' are assumed to be present on all processes when calling rcg_optl_mpi.

mixture_components

Use 'rcgpar::mixture_components' to transform the matrix from rcg_optl_omp/mpi into a probability vector containing the relative contributions of each mixture component. 'mixture_components' takes the following input arguments:

const rcgpar::Matrix<double> &probs:
    The matrix returned from either rcg_optl_omp or rcg_optl_mpi.
const std::vector<double> &log_times_observed:
    The N-dimensional vector of log times observed that was used
	as input to the call to rcg_optl_omp or rcg_optl_mpi.

'mixture_components' will return a N-dimensional probability vector containing the mixture component proportions.

Creating the input matrix

rcgpar requires the input log-likelihood matrix formatted with the internal rcgpar::Matrix class. If your input log-likelihoods are stored in a flattened vector, you can construct the input object to rcg_optl_omp/mpi with the constructor:

Matrix<double>(std::vector<double> &flattened_logl,
               uint16_t n_mixture_components, uint32_t n_observations)

If your data is stored in a 2D vector, use the following constructor:

Matrix<double>(std::vector<std::vector<double>> &logl_2D)

Note that both constructors assume the data is stored in row-major order.

License

The source code from this project is subject to the terms of the LGPL-2.1 license. A copy of the LGPL-2.1 license is supplied with the project, or can be obtained at https://opensource.org/licenses/LGPL-2.1.

jnalanko/rcgpar