Author: Tatjana Chernenko, 2018
This project focuses on enhancing the LexRank-based text summarization system by incorporating semantic similarity measures from the ECNU system.
Inspiration:
- G. Erkan, D. R. Radev. (2014). LexRank: Graph-based Lexical Centrality as Salience in Text Summarization
- Junfeng Tian et al.(2014). ECNU at SemEval-2017 Task 1: Leverage Kernel-based Traditional NLP features and Neural Networks to Build a Universal Model for Multilingual and Cross-lingual Semantic Textual Similarity
Results: paper
The LexRank-based text summarization system employs a stochastic graph-based method to compute the relative importance of textual units for extractive multi-document text summarization. This implementation initially utilizes cosine similarity between sentences as a key metric. In this model, a connectivity matrix based on intra-sentence cosine similarity is used as the adjacency matrix of the graph representation of sentences.
The objective is to explore the impact of replacing cosine similarity with a combination of features from the ECNU system, known for its semantic similarity measure. This modification aims to improve the summarization effectiveness of the LexRank approach.
-
LexRank Implementation:
- Sentence representation using TF·IDF metrics
- Cosine similarity matrix generation
- Sentence centrality computation
-
Implementation of ECNU Features and Learning Algorithms:
- Traditional NLP Module extraction
- Deep Learning Module training
- Ensemble Module for final score aggregation
-
Modified LexRank (LexRank with ECNU):
- Replacing cosine similarity with ECNU features and learning algorithm
The evaluation is conducted using the DUC2003 summarization evaluations dataset. Performance is measured using ROUGE (Recall-Oriented Understudy for Gisting Evaluation), providing insights into the summarization quality.
Metric |
---|
ROUGE-1 Recall |
ROUGE-1 Precision |
ROUGE-1 F-score |
ROUGE-su4 Recall |
ROUGE-su4 Precision |
ROUGE-su4 F-score |
Both LexRank and Modified LexRank showcase competitive ROUGE scores (61.7% ROUGE-su4 on the three multi-document clusters for the short summaries of three sentences), indicating good summarization capability. The modified version, leveraging ECNU features, allows us to achieve 62.87% average ROUGE-su4. The modified system demonstrates potential for improvement, particularly with further experiments and enhancements like increased sentence limits and the inclusion of a reranker module.
-
python3
-
Get a copy of the project on your local machine for development and testing purposes:
git clone https://gitlab.cl.uni-heidelberg.de/chernenko/automatic_textsummarization_ws_17_18.git
cd automatic_text_summarization
Download https://stanfordnlp.github.io/CoreNLP/ and put the unzipped folder to the data folder.
sh download.sh
pip install -U scikit-learn
pip install pyprind
pip install python-jsonrpc
pip install gensim
pip install selenium
pip install nltk
pip install lxml
pip install breadability
pip install beautifulsoup4
- lanch the stanford CoreNLP:
cd ./data/stanford-corenlp-full-2015-12-09/stanford-corenlp-full-2015-12-09
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer
After this, you will find stanfordCoreNLP server at http://localhost:9000/
Write all new commands in a new terminal window.
- A. You have the possibility to try a standard Implementation of LexRank on demo data (one DUC20013 Task 2 multi-document cluster, copressed in one file).
After installing the program (Section 1.1 and 1.2 (auto) /1.3(manually)) run the following commands in your terminal (we assume that you are in the automatic_text_summarization folder):
cd demo
sh demo.sh
You find the output summary in ./demo folder.
- B. You also can try a standard version of LexRank for web-pages:
cd demo
python lex_rank_demo.py my-lex-rank --length=10 --url=http://en.wikipedia.org/wiki/Automatic_summarization
- C. If you want to test the modified version of LexRank, please note, that it will take from 6-7 hours just for one of three document clusters. Please follow the instructions in Section 3, part "Lounch an improved version of multi-document LexRank".
- Get & preprpocess test data from the set of documents of Task 2 DUC2003:
cd ../ # home directory of the project
cd lib
mkdir test_data
cd ../bin
python get_test_data.py
Open stanfordCoreNLP server in browser (please check that you have run sh download sh in (Section 1.2 for automatic installation)
or have lancheded the stanford CoreNLP manually (Section 1.3 for manually installation) before running this command:
http://localhost:9000/
- Lounch a standard version of multi-document LexRank without re-ranker (length of output files = 3 sentences).
(Find the output summaries in ./output/summaries_standard)
In a new terminal window run:
cd ..
mkdir output
cd output
mkdir summaries_standard
cd ../
sh shell_standard.sh
- Lounch an improved version of multi-document LexRank without re-ranker (length of output files = 3 sentences):
(Find the output summaries in ./output/summaries_improved)
cd output
mkdir summaries_improved
cd ../
sh shell_en.sh
- Evaluate the output:
cd ./bin
python pyrouge.py
- NOTE: The standard version of the LexRank takes some minutes to summarize all the test data (texts from the Task 2 DUC2003).
The improved version of LexRank takes from 6 hours (depending on your system) to summarize just one of these texts.
The project provides all the summaries of the standard LexRank for the test data and summaries of three texts (30020, 30024 and 31010) of the improved LexRank, which are used for the evaluation.
The project considers Gradient Boosting Regression learning algorithm and two features of ECNU system: WeightednGramOverlapFeature(type='lemma') and BOWFeature(stopwords=False).
If you want to add other features: This project provides basic architechture for adding ather modules, features and learning algorithms of ECNU system in future, wgich are now under construction.
For future:
Open ./sts_model.py file, choose some features/combinations and add them to the file ./bin/lex_rank_en.py to the class LexRankSummarizer(AbstractSummarizer) (line 176 in lex_rank_en.py) LexRankSummarizer(AbstractSummarizer)
Note: This could increase the time of running the programm.
If you have any issues with opening the files, check the style of strings (depends on your system) for the path to the files in *.py files in bin folder.
If you have any issues with Datasets in data directory, you can load them yourself:
- download the STSBenchmark Dataset:
mkdir data
cd data
Dowload data from
http://ixa2.si.ehu.es/stswiki/images/4/48/Stsbenchmark.tar.gz
Unpack the file Stsbenchmark.tar.gz
Remove Stsbenchmark.tar.gz
- download the stanford CoreNLP 3.6.0
http://nlp.stanford.edu/software/stanford-corenlp-full-2015-12-09.zip
Then unzip stanford-corenlp-full-2015-12-09.zip
- Tatjana Chernenko
This project is licensed under the GNU General Public License v3.0 - see the LICENSE.md file for details
Inspiration:
Semantic Textual Similarity (STS)
For further details, including implementation specifics, evaluation results, and analysis, refer to the project documentation and associated research papers.