TextSummarization
Latent Dirichlet Allocation
Scores:
ROUGE | Precision | Recall | F-Measure |
---|---|---|---|
ROUGE-2 | 0.48919 | 0.11052 | 0.18030 |
ROUGE-L | 0.93560 | 0.21138 | 0.34485 |
Data
Sample data can be found in ./example_data
Installation
Python 3.8 was tested. It is recommended to create the virtual environment and install dependencies in it:
python3.8 -m vevn venv
source venv/bin/activate
pip install -r requirements.txt
Running
The main script is example.py. You can run it simply by typing:
python example.py
It should print the following:
TopicModel::Init
Done
Loading regulations and comments..
Done
Building TopicModel
number of low frequency tokens pruned = 11,980
min_word_count = 20, top_most_common_words = 10
number of high frequency tokens pruned = 10
tokens = 3,400 rows
text pre-processing is complete
computing LDA...
computing dominant topics...
Done
And after this, the trained model will be used to summarize documents. Here is one example:
I love Sea World, my wife a 3 kids love seeing sea animals up close. I would love to continue going to sea world and watch them change the sea. We should continue to support them and their efforts to preserve and protect marine mammals. Please have the regulations based on science so future generations can love and learn about our world as we did.
Topic 0
The top 10 terms and corresponding weights are:
* dolphin (0.0269)
* trainer (0.0159)
* time (0.0095)
* facility (0.0087)
* wa (0.0083)
* one (0.0078)
* many (0.0075)
* environment (0.0072)
* life (0.0066)
* experience (0.0065)
Default number of topics is 15, you can change this in the example.py
by modifying the following variable:
num_topics = 15