/Projects-Python

Practical demonstration of scikit learn library for building various classification and regression models

Primary LanguageJupyter Notebook

Projects-Python

Practical demonstration of scikit learn library for building various classification and regression models

Description

The ultimate goal of topic modeling is to find various topics that are present in your corpus. Each document in the corpus will be made up of at least one topic, if not multiple topics. In this notebook, we will be covering the steps on how to do Latent Dirichlet Allocation (LDA), which is one of many topic modeling techniques. It was specifically designed for text data. To use a topic modeling technique, you need to provide (1) a document-term matrix and (2) the number of topics you would like the algorithm to pick up. Once the topic modeling technique is applied, your job as a human is to interpret the results and see if the mix of words in each topic make sense. If they don't make sense, you can try changing up the number of topics, the terms in the document-term matrix, model parameters, or even try a different model.

Data set comprises of 20 Newsgroups and using LDA to extract the naturally discussed topics.

Using Latent Dirichlet Allocation (LDA) from Gensim package along with the Mallet’s implementation (via Gensim). Mallet has an efficient implementation of the LDA. It is known to run faster and gives better topics segregation.

Data set

Data can be obtained from : https://raw.githubusercontent.com/selva86/datasets/master/newsgroups.json