A one-stop Python library for fitting a wide range of mixture models such as Mixture of Gaussians, Students'-T, Factor-Analyzers, Parsimonious Gaussians, MCLUST, etc.
- Why this library
- Installation and Quick Start
- Supported Models and Optimization Routines
- Contributing
- Citating this library
While there are several packages in R and Python which support various kinds of mixture-models, each one of them has their own API and syntax. Further, in almost all those libraries, the inference proceeds via Expectation-Maximization (a Quasi first order method) which makes them unsuitable for high-dimensional data.
This library attempts to provide a seamless and unified interface for fitting a wide-range of mixture models. Unlike many existing packages that rely on Expectation-Maximization for inference, our approach leverages Automatic Differentiation tools and gradient-based optimization which makes it well equipped to handle high-dimensional data and second order optimization routines.
Installation is straightforward:
pip install Mixture-Models
from mixture_models import *
The estimation procedure consists of 3 simple steps:
### Simulate some dummy data using the built-in function make_pinwheel
data = make_pinwheel(radial_std=0.3, tangential_std=0.05, num_classes=3,
num_per_class=100, rate=0.4,rs=npr.RandomState(0))
### Plot the three clusters
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(data[:, 0], data[:, 1], 'k.')
plt.show()
### STEP 1 - Choose a mixture model to fit on your data
my_model = GMM(data)
### STEP 2 - Initialize your model with some parameters
init_params = my_model.init_params(num_components = 3,scale = 0.5)
### STEP 3 - Learn the parameters using some optimization routine
params_store = my_model.fit(init_params,"Newton-CG")
Once the model is trained on the data (which is a numpy
matrix of shape (num_datapoints, num_dim)
),
post-hoc analysis can be performed:
for params in params_store:
print("likelihood",my_model.likelihood(params))
print("aic,bic",my_model.aic(params),my_model.bic(params))
np.array(my_model.labels(data,params_store[-1])) ## final predicted labels
Example notebooks are available on the project Github repo.
There are more than 30+ different mixture-models, spread across five model families, currently supported by the library. Here is a brief overview of the different model families supported:
GMM
: Standard Gaussian mixture modelGMM_Constrainted
: GMM with common covariance across componentsMclust
: MCLUST family of constrained GMMs
MFA
: Mixture-of-factor analyzersPGMM
: Parsimonious GMM extension with constraints
TMM
: Mixture of t-distributions
The project repo 'Examples' folder includes more detailed illustrations for all these models, as well as a README.md
for advanced users who want to fit custom mixture models,
or tinker with the settings for the above procedure.
Currently, four main gradient based optimizers are available:
"grad_descent"
: Stochastic Gradient Descent (SGD) with momentum"rms_prop"
: Root-mean-squared propagation (RMS-Prop)"adam"
: Adaptive moments (ADAM)"Newton-CG"
: Newton-Conjugate Gradient (Newton CG)
The details about each optimizer and its optional input parameters are given in the PDF in the 'Examples' folder.
The output of fit
method is the set of all points in the parameter space that the optimizer has traversed during the optimization i.e. list of parameters with the final entry in the list being the final fitted solution.
We have a detailed notebook Optimizers_illustration.ipynb
in the 'Examples' folder on Github.
We welcome contributions to our library. Our code base is highly modularized, making it easy for new contributors to extend its capabilities and add support for additional models. If you are interested in contributing to the library, check out the contribution guide.
If you're unsure where to start, check out our open issues for inspiration on the kind of problems you can work on. Alternately, you could also open a new issue so we can discuss the best strategy for integrating your work.
If you use this package, please consider citing our research as
@article{kasa2024mixture, title={Mixture-Models: a one-stop Python Library for Model-based Clustering using various Mixture Models}, author={Kasa, Siva Rajesh and Yijie, Hu and Kasa, Santhosh Kumar and Rajan, Vaibhav}, journal={arXiv preprint arXiv:2402.10229}, year={2024} } }
@article{kasa2020model, title={Model-based Clustering using Automatic Differentiation: Confronting Misspecification and High-Dimensional Data}, author={Kasa, Siva Rajesh and Rajan, Vaibhav}, journal={arXiv preprint arXiv:2007.12786}, year={2020} }