Sherlock: An Interactive Summarization of Large Text Collections.

In this demo paper, we present a new system for interactive text summarization called Sherlock. The task of automatically producing textual summaries is an important step to understand a collection of multiple topic-related documents. It has many real-world applications in journalism, medicine, and many more. However, none of the existing summarization systems allow users to provide feedback at interactive speed. We therefore integrate a new approximate summarization model into Sherlock that can guarantee interactive speeds even for large text collections to keep the user engaged in the process.

Online demo: http://sherlock.ukp.informatik.tu-darmstadt.de/
Video: https://vimeo.com/257601765

If you reuse this software, please use the following citation:

@INPROCEEDINGS{PVS:2018a, 
  author = {P.V.S., Avinesh and Hättasch, Benjamin and Özyurt, Orkan and Binnig, Carsten and Meyer, Christian M.}, 
  title = {{Sherlock: A System for Interactive Summarization of Large Text Collections}}, 
  booktitle = {Proceedings of the VLDB Endowment}, 
  pages = {1902--1905}, 
  volume = {11}, 
  number = {12}, 
  month = {August}, 
  year = {2018}, 
  location = {Rio de Janeiro, Brazil}, 
  language = {English}, 
  doi = {10.14778/3229863.3236220}, 
  pdf = {http://www.vldb.org/pvldb/vol11/p1902-p.v.s..pdf}, 
  url = {https://github.com/AIPHES/vldb2018-sherlock/} 
}

Contact person:

Avinesh P.V.S., first_name AT aiphes.tu-darmstadt.de
Benajamin Haettasch, last_name AT aiphes.tu-darmstadt.de
https://www.aiphes.tu-darmstadt.de/
https://www.tu-darmstadt.de/

Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

Prerequisites

python >= 2.7 (tested with 2.7.6)
jdk 8
JAVA_HOME env variable has to be set

Setting up the Sherlock UI

Install Anaconda
Install the requirements
```
pip install -r requirements.txt
```

Install GLPK and CPLEX for PULP (Python Integer Linear Programming Package)

Install GLPK
```
sudo apt-get install libglpk-dev
```

Install CPLEX for Pulp

cd ukpsummarizer-be/cplex/cplex/python/2.7/x86-64_linux/
python setup.py install

Perl dependencies for ROUGE:

LOCAL::LIB
```
sudo apt-get install liblocal-lib-perl
```
XML::DOM
```
sudo apt-get install libxml-dom-perl
```
libexpat :
```
sudo apt-get install libexpat-dev
```

libparser

sudo apt-get install libxml-parser-perl

Bash as the default (source to work):
```
sudo dpkg-reconfigure dash
```

Build and Run

Setup the sample data on your system.

cp -r data ~/.ukpsummarizer/datasets

The result of the build produces dist/ukpsummarizer-dist-bin.tar file which should be a standalone bundle.

./mvnw clean install
./mvnw -pl ukpsummarizer-server spring-boot:run

Alternatively:

tar -xvf dist/ukpsummarizer-dist-bin.tar
java -jar ukpsummarizer-server.jar

Preparing Data Directory:

Create a io directory, perefably in ~/.ukpsummarizer, which has the following layout:

 +--+cache/
 +--+datasets/
 |  +--+raw/
 |  +--+processed/
 |     +--+DUC2006/
 |     |  +--+D0601A/
 |     |  |  +--+docs/
 |     |  |  +--+docs.parsed/
 |     |  |  +--+summaries/
 |     |  |  +--+summaries.parsed/
 |     |  |  +--+summaries.upperbound/
 |     |  +--+task.json
 |     |  +--+...
 |     +--+ ...
 +--+embeddings/
 |  +--+english/
 |  |  +--+GoogleNews-vectors-negative300.bin
 |  |     +data/
 |  +--+german/
 |     +2014_tudarmstadt_german_50mincount.vec
 ...

Download and add the word embeddings into the ~/.ukpsummarizer/embeddings directory

Download the Google embeddings (English) from here

  >> mkdir -p summarizer/data/embeddings/english
  >> mv GoogleNews-vectors-negative300.bin.gz ~/.ukpsummarizer/embeddings/english

Download the News, Wikipedia embeddings (German) from here

  >> mkdir -p summarizer/data/embeddings/german
  >> mv 2014_tudarmstadt_german_50mincount.vec ~/.ukpsummarizer/embeddings/german

Download and install the GloVe embeddings from here

  >> mkdir -p summarizer/data/embeddings/english/glove
  >> mv *.txt.w2v ~/.ukpsummarizer/embeddings/english/glove

Make sure that you have the raw datasets available. Each raw dataset needs to be extracted and follow the following directory structure:

 +--+DUC2006
    +--+docs
    |  +-+D0601A
    |    +-+ many files
    |  +-+D0650E
    +--+models
    |  +-+ many files
    +--+topics.xml

Before running the pipeline, you have to preprocess the raw datasets using the make_data.py script.

python ukpsummarizer-be/data_processer/make_data.py -d DUC2006  -p ~/.ukpsummarizer/datasets/raw  -a parse -l english
python ukpsummarizer-be/data_processer/make_data.py -d DUC2004  -p ~/.ukpsummarizer/datasets/raw  -a parse -l english
python ukpsummarizer-be/data_processer/make_data.py -d TEST     -p ~/.ukpsummarizer/datasets/raw  -a parse -l english
python ukpsummarizer-be/data_processer/make_data.py -d DBS      -p ~/.ukpsummarizer/datasets/raw  -a parse -l german

The results should then be copied into a directory. We recommend using the --iobasedir argument to set the directory

 +--+cache/
 +--+datasets/
 |  +--+raw/
 |  +--+processed/
 |     +--+DUC2006/
 |     |  +--+D0601A/
 |     |  |  +--+docs/
 |     |  |  +--+docs.parsed/
 |     |  |  +--+summaries/
 |     |  |  +--+summaries.parsed/
 |     |  |  +--+summaries.upperbound/
 |     |  |  +--+task.json
 |     |  +--+...
 |     +--+ ...
 +--+embeddings/

Windows setup

Verified by one (1) user.