/mixMPLNFA

R Package That Can Simultaneously Perform Factor Analysis And Cluster Analysis Of Count Data Via Parsimonious Finite Mixtures of Multivariate Poisson-Log Normal Factor Analyzers. This Model Permits For Parsimonious Covariance Structures And Dimension Reduction, Thus Reducing The Number Of Free Parameters To Be Calculated.

Primary LanguageRMIT LicenseMIT

mixMPLNFA

Mixtures of Multivariate Poisson-Log Normal Factor Analyzers for Clustering Count Data

GitHub issues License GitHub language count GitHub commit activity (branch)

Description

mixMPLNFA is an R package for performing clustering using parsimonious mixtures of multivariate Poisson-log normal factor analyzers family (MPLNFA) via variational Gaussian approximations. It was developed for count data, with clustering of RNA sequencing data as a motivation. However, the clustering method may be applied to other types of count data. This model considers a factor analyzer structure and this reduces the number of free covariance structure parameters to be calculated. With the introduction of the factor analysis structure, the number of covariance parameters to be calculated is linear in data dimensionality, thus making this family well suited for analysis of high-dimensional discrete data. This package provides functions for data simulation and clustering with parameter estimation via a variational Gaussian approximation with Expectation-Maximization (EM) algorithm. Information criteria (AIC, BIC, AIC3 and ICL) are offered for model selection.

Installation

To install the latest version of the package:

require("devtools")
devtools::install_github("anjalisilva/mixMPLNFA", build_vignettes = TRUE)
library("mixMPLNFA")

To run the Shiny app (under construction):

mixMPLNFA::runMixMPLNFA()

Overview

To list all functions available in the package:

ls("package:mixMPLNFA")

MPLNClust contains 4 functions.

  1. mplnFADataGenerator for generating simulated data with known number of latent factors, a known covariance structure model and a known number of clusters/components via mixtures of multivariate Poisson-log normal factor analyzers
  2. MPLNFAClust for carrying out clustering of count data using parsimonious mixtures of multivariate Poisson-log normal factor analyzers. Can input user provided count dataset or a dataset generated via the mplnFADataGenerator() function
  3. mplnFAVisLine for visualizing clustering results as line plots
  4. runMixMPLNFA is the shiny implementation of MPLNFAClust (under construction)

For more information, see details section below. An overview of the package is illustrated below:

Details

Mixture model-based clustering methods can be over-parameterized in high-dimensional spaces, especially as the number of clusters increases. Subspace clustering allows to cluster data in low-dimensional subspaces, while keeping all the dimensions and by introducing restrictions to mixture parameters (Bouveyron and Brunet, 2014). Restrictions are introduced to the model parameters with the aim of obtaining parsimonious models, which are sufficiently flexible for clustering purposes. Since the largest contribution of free parameters is through the covariance matrices, it is a natural focus for the introduction of parsimony.

The factor analysis model was introduced by Spearman (1904) and is useful in modeling the covariance structure of high-dimensional data using a small number of latent variables. The mixture of factor analyzers model was later introduced by Ghahramani et al., 1996, and this model is able to concurrently perform clustering and, within each cluster, local dimensionality reduction. In 2008, a family of eight parsimonious Gaussian mixture models (PGMMs; McNicholas and Murphy, 2008) were introduced with parsimonious covariance structures. In 2019, a model-based clustering methodology using mixtures of multivariate Poisson-log normal distribution (MPLN; Aitchison and Ho, 1989) was developed to analyze multivariate count measurements by Silva et al., 2019. In current work, a family of mixtures of MPLN factor analyzers that is analogous to the PGMM family is developed, by considering the general mixture of factor analyzers model ($\mathbf{\Sigma}_g$ = $\mathbf{\Lambda}_g$ $\mathbf{\Lambda}_g^{\prime}$ + $\mathbf{\Psi}_g$) and by allowing the constraints $\mathbf{\Lambda}_g = \mathbf{\Lambda}$, $\mathbf{\Psi}_g = \mathbf{\Psi}$, and the isotropic constraint $\mathbf{\Psi}_g$ = $\psi_g \mathbf{I}_d$. This new family is referred to as the parsimonious mixtures of MPLN factor analyzers family (MPLNFA). The proposed model simultaneously performs factor analysis and cluster analysis, by assuming that the discrete observed data have been generated by a factor analyzer model with continuous latent variables. See vignette for more details.

Variational-EM Framework for Parameter Estimation

Subedi and Browne (2020) had proposed a framework for parameter estimation utilizing variational Gaussian approximation (VGA) for mixtures of multivariate Poisson-log normal distribution-based mixture models. Markov chain Monte Carlo expectation-maximization (MCMC-EM) has also been used for parameter estimation of MPLN-based mixture models, but VGA was shown to be computationally efficient (Silva et al., 2023). VGA alleviates challenges of MCMC-EM algorithm. Here the posterior distribution is approximated by minimizing the Kullback-Leibler (KL) divergence between the true and the approximating densities. A variational-EM based framework is used for parameter estimation.

Model Selection and Other Details

Four model selection criteria are offered, which include the Akaike information criterion (AIC; Akaike, 1973), the Bayesian information criterion (BIC; Schwarz, 1978), a variation of the AIC used by Bozdogan (1994) called AIC3, and the integrated completed likelihood (ICL; Biernacki et al., 2000).

Starting values play an important role to the successful operation of this algorithm. There maybe issues with singularity, in which case altering initialization method or initialization values by setting a different seed may help. See function examples or vignette for details.

Shiny App

The Shiny app employing MPLNFAClust could be run and results could be visualized:

mixMPLNFA::runMixMPLNFA()

Tutorials

For tutorials and plot interpretation, refer to the vignette (under construction):

browseVignettes("mixMPLNFA")

Citation for Package

citation("mixMPLNFA")

Payne, A., A. Silva, S. J. Rothstein, P. D. McNicholas, and S. Subedi (2023) Finite Mixtures of Multivariate Poisson-Log Normal Factor Analyzers for Clustering Count Data. Unpublished.

A BibTeX entry for LaTeX users is

  @unpublished{,
  title        = "Finite Mixtures of Multivariate Poisson-Log Normal Factor Analyzers for Clustering Count Data",
  author       = "A. Payne and A. Silva and S. J. Rothstein and P. D. McNicholas and S. Subedi",
  note         = "Unpublished",
  year         = "2023",
  }

References

Authors

Maintainer

Contributions

mixMPLNFA repository welcomes issues, enhancement requests, and other contributions. To submit an issue, use the GitHub issues.

Acknowledgments

  • Dr. Marcelo Ponce, SciNet HPC Consortium, University of Toronto, ON, Canada for all the computational support.
  • Early work was funded by Natural Sciences and Engineering Research Council of Canada (Subedi) and Queen Elizabeth II Graduate Scholarship (Silva).
  • Later work was supported by the Postdoctoral Fellowship award from the Canadian Institutes of Health Research (Silva) and the Canada Natural Sciences and Engineering Research Council grant 400920-2013 (Subedi).