/usc_dae

Repository for Unsupervised Sentence Compression using Denoising Auto-Encoders

Primary LanguagePython

Experiments in Unsupervised summarization

This is our Pytorch implementation of the summarization methods described in Unsupervised Sentence Compression using Denoising Autoencoders (CoNLL 2018). It features denoising additive auto-encoders with optional NLI hidden state initialization (based on Infersent).

Model architecture

Table of Contents

Requirements

pip install -r requirements.txt

Quickstart

Step 1: Get the data and create the vocabulary

Gigaword data can be downloaded from : https://github.com/harvardnlp/sent-summary. Then, extract it (tar -xzf summary.tar.gz ). Vocabulary can then be created by running python src/datasets/preprocess.py train_data_file output_voc_file)

Step 2: Create an environment configuration file

This is used to locate the datasets, embeddings, whether you want to use gpu, etc. on your computer. You can see an example configuration at env_configs/env_config.json. You only need to set nli variables if you use InferSent embeddings.

Then setup the variable NLU_ENV_CONFIG_PATH to point to that file (e.g: export NLU_ENV_CONFIG_PATH="env_configs/env_config.json").

Step 3: Train the model

Simply run:

python sample_scripts/dae_json.py runs/default/default.json

Step 4: Run inference

python sample_scripts/simple_inference.py model_path test_data_path [output_data_path]

Step 5: Evaluate ROUGE scores

To evaluate for rouge, we use files2rouge, which itself uses pythonrouge.

Installation instructions:

pip install git+https://github.com/tagucci/pythonrouge.git
git clone https://github.com/pltrdy/files2rouge.git
cd files2rouge
python setup_rouge.py
python setup.py install

To run evaluation, simply run:

files2rouge summaries.txt references.txt

FAQ

  • Random seed: We did not use a random seed nor random restarts for the results in the paper
  • Teacher forcing: We used teacher forcing in all of our experiments
  • Beam search: We decoded using greedy decoding only, never using beam search
  • Added noise: Is done on a sentence-per-sentence basis, not based on the max length in a batch. This is critical for performance