This project aims to automatically generate headlines of news articles in the DUC 2004 corpus. It's purpose was to get hands-on experience in the Natural Language Processing course at ETH Zurich in spring 2013.
ROUGE depends on the XML::DOM perl module. On Ubuntu, install it using
sudo apt-get install libxml-dom-perl
This project uses the Maven build system. On Ubuntu, install it using
sudo apt-get install maven
Run the eval script to generate the headlines for all 500 documents in the DUC 2004 dataset. It computes and outputs the ROUGE score at the end.
./eval
Note that this distribution may not come with the cached annotations. Thus, the initial run may take a long time (about an hour). Any subsequent runs will only take 15 to 30 seconds.
The file 'rouge.model' in the root directory contains the model used for sentence compression. It has been trained on the DUC 2003 data set using SVM regression. To generate it on your own, run the RougeScoreRegression class in the 'learning' package.
The code builds upon a range of excellent open source projects:
- Stanford CoreNLP serves as the underlying framework and takes care of tokenization, sentence splitting, part-of-speech tagging, named entity recognition and parsing
- java-xmlbuilder for generating ROUGE configuration files
- Google's Guava provides excellent additions to Java's core libraries and encourages writing robust code
- Logback is a generic logging framework and the successor of the popular log4j project
- SLF4J serves as a simple facade for various logging frameworks like Logback
- Apache Commons
- kryo is a fast and efficient Java serialization library that provides an enormous performance boost by caching the parse trees of news articles
- libsvm is a library for Support Vector Machines
- ROUGE is a software package for automated evaluation of summaries