EmailSum (ACL 2021)

This repository contains the data and code for the following paper:

EmailSum: Abstractive Email Thread Summarization

@inproceedings{zhang2021emailsum,
  title={EmailSum: Abstractive Email Thread Summarization},
  author={Zhang, Shiyue and Celikyilmaz, Asli and Gao, Jianfeng and Bansal, Mohit},
  booktitle={Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics},
  year={2021}
}

Data

We only release the summaries we collected and provide scripts to extract email threads from flat email corpus (Avocado or W3C), because Avocado's copyright is protected by Linguistic Data Consortium.

Requirements

Python 3
requirements.txt
Download Avocado Research Email Collection from LDC

Avocado

We collected the summaries of 2,549 Avocado email threads (see Avocado/summaries/EmailSum_data.json). We collected one more reference for each of the 500 email threads in the testing set after submission (see Avocado/summaries/one_more_reference.json).

First, cd Avocado/
Download "emails.json" from here and put it under Avocado/
Extract threads, assuming $ROOT_DIR contains the LDC2015T03 (i.e., $ROOT_DIR/LDC2015T03/Data/avocado-1.0.2)

python extract_threads.py --root_dir $ROOT_DIR

You will get "Avocado.json" which contains all extracted threads.

Anonymize & Filter

python anonymize.py

After this step, you can see cleaned threads under "Avocado_threads/".

Prepare Train/Dev/Test files

python standardize.py

After this step, you can see experimental files under "exp_data/". There are two sub-directories: "data_email_short" and "data_email_long" for short and long summary, respectively. Each line of the *.source file is one email thread, in which emails are separated by "|||".

W3C

We provide the code for extracting threads from W3C email corpus for semi-supervised learning.

First, cd "W3C/"
Download raw data files from here and put them under "W3C/raw_data/"
Extract threads

python extract_threads.py

You will get "W3C.json" which contains all extracted threads.

Anonymize & Filter

python anonymize.py

After this step, you can see all cleaned thread under "W3C_threads/".

Model

Requirements

Python 3
PyTorch 1.7, transformers==2.11.0

Test pre-trained models

Download pre-trained models from here, decompress, and put them under "train/".

Note that we conduct model selection for each metric, so there are multiple best checkpoints, e.g., "checkpoint-rouge1" is the best ROUGE1 checkpoint selected by ROUGE1 on development set. "best_ckpt.json" contains the best scores on development set.

Prepare data

After you get "Avocado/exp_data/data_email_short" and "Avocado/exp_data/data_email_long", run

python3 data.py --data_dir Avocado/exp_data/data_email_long --cache_dir train/cache --max_output_length 128  
python3 data.py --data_dir Avocado/exp_data/data_email_short --cache_dir train/cache --max_output_length 56

Test

T5 baselines

python3 run.py --task email_long --data_dir Avocado/exp_data/data_email_long/ --test_only --max_output_length 128
python3 run.py --task email_short --data_dir Avocado/exp_data/data_email_short/ --test_only --max_output_length 56

Hierarchical T5

python3 run.py --task email_long --memory_type ht5 --data_dir Avocado/exp_data/data_email_long/ --test_only --max_output_length 128
python3 run.py --task email_short --memory_type ht5 --data_dir Avocado/exp_data/data_email_short/ --test_only --max_output_length 56

Semi-supervised models

python3 run.py --task email_long_w3c --data_dir Avocado/exp_data/data_email_long/ --test_only --max_output_length 128
python3 run.py --task email_short_together --data_dir Avocado/exp_data/data_email_short/ --test_only --max_output_length 56

The testing scores will be saved in "best_ckpt_test.json". We provide "best_ckpt_test_verification.json" for verification of results, almost the same numbers should be obtained.

We also provide "best_ckpt_test_old.json" that contains our previously tested scores (reported in the paper). You are likely to get slightly different numbers from "best_ckpt_test_old.json" because we added a few more data clean and anonymization rules. The pre-processed *.source files will be slightly different from the ones we used before.

Test with two references

Just add "--two_ref", e.g.,

python3 run.py --task email_long --data_dir Avocado/exp_data/data_email_long/ --test_only --two_ref --max_output_length 128

The testing scores will be saved in "best_ckpt_test_2ref.json". We provide "best_ckpt_test_2ref_verification.json" for verification of results, almost the same numbers should be obtained.

Benchmark Results

One-reference results:

EmailSum Short	rouge1	rouge2	rougeL	rougeLsum	BERTScore
T5 base	36.61	10.58	28.29	32.77	33.92
HT5	36.30	10.74	28.52	33.33	33.49
Semi-sup. (together)	36.99	11.22	28.71	33.70	33.91

EmailSum Long	rouge1	rouge2	rougeL	rougeLsum	BERTScore
T5 base	43.87	14.10	30.50	39.91	32.07
HT5	44.44	14.51	30.86	40.24	32.31
Semi-sup. (w3c)	44.58	14.64	31.40	40.73	32.80

Two-reference results (average the results of two references):

EmailSum Short	rouge1	rouge2	rougeL	rougeLsum	BERTScore
T5 base	35.22	9.60	27.08	31.22	32.45
HT5	34.81	9.82	27.28	31.74	32.42
Semi-sup. (together)	35.52	10.35	27.29	33.11	32.24

EmailSum Long	rouge1	rouge2	rougeL	rougeLsum	BERTScore
T5 base	43.41	13.81	29.97	39.32	31.58
HT5	43.86	14.06	30.17	39.64	31.84
Semi-sup. (w3c)	43.99	14.18	30.56	40.12	32.04

Interestingly, we always get lower scores when comparing to the 2nd reference we collected after paper submission. That's why two-reference results are always worse than one-reference ones. It may be caused by the different set of turkers involved in summary annotation that brings domain shift.

Train

Just drop "--test_only", e.g.,

python3 run.py --task email_long --data_dir Avocado/exp_data/data_email_long/ --max_output_length 128

ZhangShiyue/EmailSum

EmailSum (ACL 2021)

Data

Requirements

Avocado

W3C

Model

Requirements

Test pre-trained models

Benchmark Results

Train