/NLP-Extractive-NEWS-summarization-using-MMR

A simple python implementation of the Maximal Marginal Relevance (MMR) baseline system for text summarization.

Primary LanguagePython

Comparison of MMR and LexRank Automatic Text Summarization approaches

Automatic summarization techniques are used to automate the process of summarizing document(s) to form a relatively shorter summary that conveys the most important information from the original larger text. Multi-document summarization in particular is used to extract summary from multiple documents written about the same topic. Here we have to tried to implement and compare two very commonly used techniques for multi-document automatic text summarization:

Implementation details

We have implemented both MMR and LexRank algorithms in python. For evaluation purpose we have used the DUC2004 data corpus which contains two sets of documents.

  • Documents/clusters: contains 50 different topics each containing on average 10 news articles.

  • Manual summaries: manually created summary for each of the 50 topics.

The generated summaries are evaluated against the human summaries using the ROUGE toolkit. The ROUGE scores help to compare the efficiency of the individual summarization systems. However we have also performed an analysis of how much similar (overlap) the summaries generated by each of these systems are by calculating Jaccard coefficient score for sentence level and word level overlap.

System/software requirements

We had implemented both MMR and LexRank and ran the evaluations (both ROUGE and Jaccard evaluations) on Ubuntu 14.04. The following packages were installed as part of this on the Ubuntu OS:

Files and Folders

Folder Description
root root folder of project containing all required files and folders
Documents news articles relating to the 50 topics (each topic containing 10 articles)
Humman_Summaries human summaries used to evaluate the quality of system generated
Lexrank_results folder which holds the system generated summaries of LexRank
MMR_results folder which holds the system generated summaries of MMR
LexRank.py LexRank summarizer implementation
mmr_summarizer.py MMR summarizer implementation
sentence.py sentence class for modelling sentences in the document cluster
jaccardScore.py for generating jaccard coefficient at word and sentence level
test_pyrouge.py for generating the ROUGE scores for the system summaries

How to run:

  • For generating the MMR system summaries run the mmr_summarizer.py. The results will be generated in the MMR_results folder.

  • For generating the LexRank system summaries run the LexRank.py. The results will be generated in the Lexrank_results folder.

  • For generating the ROUGE scores run the test_pyrouge.py. Results will be displayed on the terminal

  • For generating the Jaccard coefficient scores run the jaccardScore.py. Both word and sentence level scores will be displayed on the screen

NOTE:

The documents from DUC2004 have not been added here. These documents can be obtained from here.

This work was done as part of the CAP6640: Natural Language Processing course at UCF in Spring 2016 along with Amar Nair and Syed Ahmed.