/GPM

Primary LanguagePythonMIT LicenseMIT

GPyM_TM

GPyM_TM is a Python package to perform topic modelling, either through the use of the Dirichlet multinomial mixture model (GSDMM) [1] or the Gamma Poisson mixture model (GPM) [2]. Each of the above models is available within the package in a separate class, namely GSDMM and GPM, respectively. The package is also available on Pypi.

Preamble

The aim of topic modelling is to extract latent topics from large corpora. GSDMM [1] and GPM [2] assume each document belongs to a single topic, which is a suitable assumption for some short texts. Given an initial number of topics, K, this algorithm clusters documents and extracts the topical structures present within the corpus. If K is set to a high value, then the models will also automatically learn the number of clusters.

[1] Yin, J. and Wang, J., 2014, August. A Dirichlet multinomial mixture model-based approach for short text clustering. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 233-242)

[2] Mazarura, J., de Waal, A. and de Villiers, P., 2020. A Gamma-Poisson Mixture Topic Model for Short Text. Mathematical Problems in Engineering, 2020

Further details about the GPM can be found in my thesis here.

Getting Started:

The package is available online for use within Python 3 enviroments.

The installation can be performed through the use of a standard 'pip' install command, as provided below:

pip install GPyM-TM

Prerequisites:

The package has several dependencies, namely:

  • numpy
  • random
  • math
  • pandas
  • re
  • nltk
  • gensim
  • scipy

GSDMM

Function and class description:

The class is named GSDMM, while the function itself is named DMM.

The function can take 6 possible arguments, two of which are required, and the remaining 4 being optional.

The required arguments are:

  • corpus - text file, which has been cleaned and loaded into Python. That is, the text should all be lowercase, all punctuation and numbers should have also been removed.
  • nTopics - the number of topics.

The optional requirements are:

  • alpha, beta - these are the distribution specific parameters.(The defaults for both of these parameters are 0.1.)
  • nTopWords - number of top words per a topic.(The default is 10.)
  • iters - number of Gibbs sampler iterations.(The default is 15.)

Output:

The function provides several components of output, namely:

  • psi - topic x word matrix.
  • theta - document x topic matrix.
  • topics - the top words per topic.
  • assignments - the topic numbers of selected topics only, as well as the final topic assignments.
  • Final k - the final number of selected topics.
  • coherence - the coherence score, which is a performance measure.
  • selected_theta
  • selected_psi

GPM

Function and class description:

The class is named GPM, while the function itself is named GPM.

The function can take 8 possible arguments, two of which are required, and the remaining 6 being optional.

The required arguments are:

  • corpus - text file, which has been cleaned and loaded into Python. That is, the text should all be lowercase, all punctuation and numbers should have also been removed.
  • nTopics - the number of topics.

The optional requirements are:

  • alpha, beta and gam - these are the distribution specific parameters.(The defaults for these parameters are alpha = 0.001, beta = 0.001 and gam = 0.1 respectively.)
  • nTopWords - number of top words per a topic.(The default is 10.)
  • iters - number of Gibbs sampler iterations.(The default is 15.)
  • N - this is a parameter used to normalize the document lengths, which is required for the Poisson model.

Output:

The function provides several components of output, namely:

  • psi - topic x word matrix.
  • theta - document x topic matrix.
  • topics - the top words per topic.
  • assignments - the topic numbers of selected topics only, as well as the final topic assignments.
  • Final k - the final number of selected topics.
  • coherence - the coherence score, which is a performance measure.
  • selected_theta
  • selected_psi

Example Usage:

A more comprehensive tutorial is also available.

Installation;

Run the following command within a Python command window:

pip install GPym_TM

Implementation;

Import the package into the relevant python script, with the following:

from GPyM_TM import GSDMM from GPyM_TM import GPM

Call the class:

Possible examples of calling the GSDMM function are as follows:

data_DMM = GSDMM.DMM(corpus, nTopics)

data_DMM = GSDMM.DMM(corpus, nTopics, alpha = 0.25, beta = 0.15, nTopWords = 12, iters =5)

Possible examples of calling the GPM function are as follows:

data_GPM = GPM.GPM(corpus, nTopics)

data_GPM = GPM.GPM(corpus, nTopics, alpha = 0.002, beta = 0.03, gam = 0.06, nTopWords = 12, iters = 7, N = 8)

Results;

The output obtained for the Dirichlet multinomial mixture model appears as follows:

Post

While, the output obtained for the Poisson model appears as follows:

poisson

Built With:

Google Collab - Web framework

Python - Programming language of choice

Pypi - Distribution

Authors:

Jocelyn Mazarura

Co-Authors:

I would like to extend a special thank you to my colleagues Alta de Waal and Ricardo Marques. None of this would have been possible without either of you.

Thank you!

License:

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments:

University of Pretoria Tuks Logo