little-mallet-wrapper

This is a little Python wrapper around the topic modeling functions of MALLET.

Currently under construction; please send feedback/requests to Maria Antoniak.

Updates

v0.0.12: Import and training functions now display MALLET output and error messages.

Installation

pip install little_mallet_wrapper==0.0.12

Requirements

Python 3.7
MALLET
pandas
numpy
seaborn (for plotting functions)

Usage

See demo.ipynb for a demonstration of how to use the functions in little-mallet-wrapper.

Documentation

`print_dataset_stats(training_data)`

Displays basic statistics about the training dataset.

Name	Type	Description
`training_data`	list of strings	Documents that will be used to train the topic model.

`process_string(text, lowercase=True, remove_short_words=True, remove_stop_words=True, remove_punctuation=True, numbers='replace', stop_words=STOPS)`

A simple string processor that prepares raw text for topic modeling.

Name	Type	Description
`text`	string	Individual document to process.
`lowercase`	boolean	Whether or not to lowercase the text.
`remove_short_words`	boolean	Whether or not to remove words with fewer than 2 characters.
`remove_stop_words`	boolean	Whether or not to remove stopwords.
`remove_punctuation`	boolean	Whether or not to remove punctuation (not A-Za-z0-9)
`remove_numbers`	string	'replace' replaces all numbers with the normalized token NUM; 'remove' removes all numbers.
`stop_words`	list of strings	Custom list of words to remove.
RETURNS	string	Processed version of the input text.

`quick_train_topic_model(path_to_mallet, output_directory_path, num_topics, training_data)`

Imports training data, trains an LDA topic model using MALLET, and returns the topic keys and document distributions.

Name	Type	Description
`path_to_mallet`	string	Path to your local MALLET installation: .../mallet-2.0.8/bin/mallet
`output_directory_path`	string	Path to where the output files should be stored.
`num_topics`	integer	The number of topics to use for training.
`training_data`	list of strings	Processed documents for training the topic model.
RETURNS	list of lists of strings	The 20 most probable words for each topic.
RETURNS	list of lists of integers	Topic distribution (list of probabilities) for each document.

`import_data(path_to_mallet, path_to_training_data, path_to_formatted_training_data, training_data, use_pipe_from=None)`

Imports the training data into MALLET formatted data that can be used for training.

Name	Type	Description
`path_to_mallet`	string	Path to your local MALLET installation: .../mallet-2.0.8/bin/mallet
`path_to_training_data`	string	Path to where the training data should be stored.
`path_to_formatted_training_data`	string	Path to where the MALLET formatted training data should be stored.
`training_data`	list of strings	Processed documents for training the topic model.
`use_pipe_from`	string	If you want to import the documents using the same model as a previous set of documents, include the path to the previous MALLET formatted training data.

`train_topic_model(path_to_mallet, path_to_formatted_training_data, path_to_model, path_to_topic_key, path_to_topic_distributions, num_topics)`

Trains an LDA topic model using MALLET.

Name	Type	Description
`path_to_mallet`	string	Path to your local MALLET installation: .../mallet-2.0.8/bin/mallet
`path_to_formatted_training_data`	string	Path to where the MALLET formatted training data is stored.
`path_to_model`	string	Path to where the model should be stored.
`path_to_topic_key`	string	Path to where the topic keys should be stored.
`path_to_topic_distributions`	string	Path to where the topic distributions should be stored.
`num_topics`	integer	The number of topics to use for training.

`load_topic_keys(topic_keys_path)`

Loads the most sets of most probable words for each topic after training a topic model.

Name	Type	Description
`topic_keys_path`	string	Path to where the topic keys are stored.
RETURNS	list of lists of strings	The 20 most probable words for each topic.

`load_topic_distributions(topic_distributions_path)`

Loads the topic distribution for each document after training a topic model.

Name	Type	Description
`topic_distributions_path`	string	Path to where the topic distributions are stored.
RETURNS	list of lists of integers	Topic distribution (list of probabilities) for each document.

`get_top_docs(training_data, topic_distributions, topic_index, n=5)`

Gets the documents with the highest probability for the target topic.

Name	Type	Description
`training_data`	list of strings	Processed documents that was used to train the topic model.
`topic_distributions`	list of lists of integers	Topic distribution (list of probabilities) for each document.
`topic_index`	integer	The index of the target topic.
`n`	integer	The number of documents to return.
RETURNS	list of tuples (float, string)	The topic probability and document text for the n documents with the highest probability for the target topic.

`plot_categories_by_topics_heatmap(labels, topic_distributions, topic_keys, output_path=None, target_labels=None, dim=None)`

If the dataset includes some time of categorical labels, creates a heatmap of the labels x topics.

Name	Type	Description
`labels`	list of strings	Document labels (e.g., authors of the documents, genres of the documents).
`topic_distributions`	list of lists of integers	Topic distribution (list of probabilities) for each document.
`topic_keys`	list of lists of strings	The 20 most probable words for each topic.
`output_path`	string	Path to where the resulting figure should be saved.
`target_labels`	list of strings	A subset of `labels` to use for plotting.
`dim`	tuple of integers	(x, y) dimensions for the resulting figure.

`plot_categories_by_topic_boxplots(labels, topic_distributions, topic_keys, output_path=None, target_labels=None, dim=None)`

If the dataset includes some time of categorical labels, creates a set of boxplots, one plot for each topic.

Name	Type	Description
`labels`	list of strings	Document labels (e.g., authors of the documents, genres of the documents).
`topic_distributions`	list of lists of integers	Topic distribution (list of probabilities) for each document.
`topic_keys`	list of lists of strings	The 20 most probable words for each topic.
`output_path`	string	Path to where the resulting figure should be saved.
`target_labels`	list of strings	A subset of `labels` to use for plotting.
`dim`	tuple of integers	(x, y) dimensions for the resulting figure.

`divide_training_data(documents, num_chunks=10)`

Given a dataset, divides each document into a set of equally sized chunks.

Name	Type	Description
`documents`	list of strings	Documents to split.
`num_chunks`	integer	How many times to split each document.
RETURNS	tuple (list of strings, list of integers, list of floats)	The divided documents, the indices of the input documents, and the positions within the documents (0-1.0).

`infer_topics(path_to_mallet, path_to_original_model, path_to_new_formatted_training_data, path_to_new_topic_distributions)`

Get topic distributions for a set of new documents using a model that has been trained on another set of documents.

Name	Type	Description
`path_to_mallet`	string	Path to your local MALLET installation: .../mallet-2.0.8/bin/mallet
`path_to_original_model`	string	Path to where the topic model was stored.
`path_to_new_formatted_training_data`	string	Path to where the MALLET formatted training data is stored.
`path_to_new_topic_distributions`	string	Path to where the topic distributions should be stored.

`plot_topics_over_time(topic_distributions, topic_keys, times, topic_index, output_path=None)`

Creates lineplots, one for each topic, showing the mean topic probability over document segments.

Name	Type	Description
`topic_distributions`	list of lists of integers	Topic distribution (list of probabilities) for each document.
`topic_keys`	list of lists of strings	The 20 most probable words for each topic.
`times`	list of floats	The division indices within the document.
`topic_index`	integer	The index of the target topic.
`output_path`	string	Path to where the resulting figure should be saved.

drvenabili/little-mallet-wrapper

little-mallet-wrapper

Updates

Installation

Requirements

Usage

Documentation

print_dataset_stats(training_data)

process_string(text, lowercase=True, remove_short_words=True, remove_stop_words=True, remove_punctuation=True, numbers='replace', stop_words=STOPS)

quick_train_topic_model(path_to_mallet, output_directory_path, num_topics, training_data)

import_data(path_to_mallet, path_to_training_data, path_to_formatted_training_data, training_data, use_pipe_from=None)

train_topic_model(path_to_mallet, path_to_formatted_training_data, path_to_model, path_to_topic_key, path_to_topic_distributions, num_topics)

load_topic_keys(topic_keys_path)

load_topic_distributions(topic_distributions_path)

get_top_docs(training_data, topic_distributions, topic_index, n=5)

plot_categories_by_topics_heatmap(labels, topic_distributions, topic_keys, output_path=None, target_labels=None, dim=None)

plot_categories_by_topic_boxplots(labels, topic_distributions, topic_keys, output_path=None, target_labels=None, dim=None)

divide_training_data(documents, num_chunks=10)

infer_topics(path_to_mallet, path_to_original_model, path_to_new_formatted_training_data, path_to_new_topic_distributions)

plot_topics_over_time(topic_distributions, topic_keys, times, topic_index, output_path=None)