MVNclust implements clustering of numerical data in higher dimensions using multivariate Gaussian mixtures. This implementation uses mixture components with equal volume, which may differ in shape and orientation. This is achieved using eigenvalue decomposition of the covariance matrix following Celeux and Covaert, Pattern Recognition, 1995.
MVNclust uses an Expectation-Maximization (EM-) algorithm to maximize the likelihood of a multivariate Gaussian Mixture Model with a predefined number of mixture components. It returns the parameters of the mixture components, the maximized log-likelihood, and Bayes Information Criterion (BIC) to allow for comparison between iterations which differ in the number of mixture components.
MVNclust heavily uses the GNU Scientific Library (GSL), primarily for linear algebra tasks.
It also includes a simulator of data from multivariate Gaussian mixtures, primarily for testing purposes.
Please note that this is primarily an experimental repository, not intended for production use.
There are two ways to compile MVNclust, depending on whether the GSL
is available as
a system library at version ≥ 2.3.
If on a Ubuntu style system, you can run:
apt search libgsl-dev
If libgsl
is available and at version ≥ 2.3 and you have root
permissions on the system, you can run:
sudo apt install libgsl-dev
to install the library. If this is successful, you can clone this repository,
and make the mvnclust
binary using the shared library:
git clone https://github.com/clwgg/MVNclust
cd MVNclust
make shared
Alternatively, if the required version of the GSL
is not available, you do not
have root permissions, or you would like to compile a mvnclust
binary that includes
the GSL
code statically, for example to share it with a system where the
system library is not installed, you can use the version of GSL
included as a submodule.
For that, clone the repository recursively:
git clone --recursive https://github.com/clwgg/MVNclust
This will clone both the MVNclust code, as well as the GSL
. Please note,
that you will need libtool
installed to compile the library, along with
the regular GNU toolchain for compilation.
After cloning, first compile the submodule, and then the MVNclust code:
cd MVNclust
make submodules
make static
This will create the static mvnclust
binary, which you can copy or move
anywhere for subsequent use.
When updating to the current version, please make sure to also update the submodules:
git pull origin master
git submodule update
make submodules
make
Usage: ./mvnclust [options] file.tsv
Options:
-k Number of clusters (default: k = 2)
-a File name for cluster assignment results (optional)
-s Simulate -s samples from a -d dimensional mixture of -k clusters (triggers simulation over EM)
-d Number of dimensions for simulation (only useful with -s)
-v Set verbosity - {0, 1, 2} (default 0)
The input file should be tab-separated, with one sample per row and one dimension per column.
The -k
flag controls the number of mixture components (clusters) which will be
used.
-a
allows the output of a file with cluster assignments (first column) and
uncertainty estimates (second column) of each data point (rows) in the input.
-s
and -d
are used for simulation and control the number of data points and
dimensions, respectively. In the case of simulation, -k
controls the number of
mixture components to simulate, and file.tsv
is the output file for simulated
data.
-v
controls the verbosity of the output.
Cluster input in three dimensions:
./mvnclust -k 3 infile.tsv
Cluster input in three dimensions, with higher verbosity and assignment output file:
./mvnclust -k 3 -v 1 -a assignments.tsv infile.tsv
Simulate 1000 data points of a 8 dimensional mixture with 4 components:
./mvnclust -k 4 -s 1000 -d 8 outfile.tsv