
Latent Dirichlet Allocation

Primary LanguagePython


Latent Dirichlet Allocation


ROUGE Precision Recall F-Measure
ROUGE-2 0.48919 0.11052 0.18030
ROUGE-L 0.93560 0.21138 0.34485


Sample data can be found in ./example_data


Python 3.8 was tested. It is recommended to create the virtual environment and install dependencies in it:

python3.8 -m vevn venv
source venv/bin/activate
pip install -r requirements.txt


The main script is example.py. You can run it simply by typing:

python example.py

It should print the following:

Loading regulations and comments..
Building TopicModel

number of low frequency tokens pruned = 11,980
min_word_count = 20, top_most_common_words = 10
number of high frequency tokens pruned = 10
tokens = 3,400 rows
text pre-processing is complete

computing LDA...
computing dominant topics...

And after this, the trained model will be used to summarize documents. Here is one example:

I love Sea World, my wife a 3 kids love seeing sea animals up close. I would love to continue going to sea world and watch them change the sea. We should continue to support them and their efforts to preserve and protect marine mammals. Please have the regulations based on science so future generations can love and learn about our world as we did.

Topic 0
The top 10 terms and corresponding weights are:
 * dolphin (0.0269)
 * trainer (0.0159)
 * time (0.0095)
 * facility (0.0087)
 * wa (0.0083)
 * one (0.0078)
 * many (0.0075)
 * environment (0.0072)
 * life (0.0066)
 * experience (0.0065)

Default number of topics is 15, you can change this in the example.py by modifying the following variable:

num_topics = 15