This code is for paper: Topic-Focused Extractive Summarization
(https://apps.cs.utexas.edu/apps/sites/default/files/tech_reports/Akshay_Honors_Thesis.pdf)
Python version: This code is in Python3.7
Package Requirements: pytorch sklearn numpy rouge transformers allennlp_models nltk tqdm allennlp spacy
Download and unzip the stories
directories from here for both CNN and Daily Mail. Put all .story
files for your domain (you'll have to filter them yourself) in one directory (e.g. ../raw_stories
)
python preprocess.py -mode jsonify -raw_path RAW_PATH -dataset_name DATASET_NAME
RAW_PATH
is the directory containing story files (../raw_stories
)DATASET_NAME
is the name you're choosing for the dataset containing all your stories - this will be used for all actions concerning this data moving forward. It also determines the target directory to save the generated json files (../data/DATASET_NAME/raw/*.json
)
python preprocess.py -mode bert_tokens_and_linear_features -dataset_name DATASET_NAME [-overwrite]
- This extracts linear features and BERT tokens for each document in a dataset to
../data/DATASET_NAME/linear/
and../data/DATASET_NAME/bert/
respectively, withtrain/
,test/
andval/
subdirectories in each DATASET_NAME
is the name of the dataset for which to extract tokens/features, json files for documents are extracted based on this- If
-overwrite
is set and train/test/val files have already been written for this dataset, then they are overwritten with a new train/test/val split, otherwise the procedure exits
python preprocess.py -mode construct_oracles -dataset_name DATASET_NAME [-vanilla_oracles]
DATASET_NAME
is the name of the dataset for which to construct oracle extractive summaries, they are saved in../data/DATASET_NAME/raw/oracles.json
- If
-vanilla_oracles
is set, then the vanilla oracle construction algorithm is used (see code for details)
python preprocess.py -mode topic_clustering -dataset_name DATASET_NAME
DATASET_NAME
is the name of the dataset for which to generate topic representations for summaries, they are saved in../data/DATASET_NAME/raw/topics.json
python train.py -dataset_name DATASET_NAME -model_type MODEL_TYPE -topic TOPIC -batch_size BATCH -epochs EPOCHS [-mini]
DATASET_NAME
is the name of the dataset for which to train a model of topicTOPIC
andMODEL_TYPE
is the type of model to train (linear
orbert
), models are saved in../models/DATASET_NAME/MODEL_TYPE/
as eitherTOPIC/
for BERT models orTOPIC.th
for Linear models- Recommended
BATCH
for BERT model is1
, and for Linear model is16
- Recommended
EPOCHS
for BERT model is4
and for Linear model is100
- If BERT model keeps running into memory on your documents because they're too long, try
-mini
- this will shorten all documents tok=10
sentences, preserving the oracle, for training.
After models for all topics have been trained, run
python eval.py -dataset_name DATASET_NAME -model_type MODEL_TYPE -mode MODE [-topics TOPICS] [-write]
DATASET_NAME
is the name of the dataset for which to test the system using models of typeMODEL_TYPE
(linear
orbert
)MODE
can bevanilla
,reconstruct
orranking
, these are three different evaluation schemesTOPICS
is of the form1 2 3 4
-topics
is only used for whenMODE
isvanilla
, it is the list of topics for which to build a summary - for other modes all topics are used. If this is not specified forvanilla
all topics are assumed- If
-write
is set, the summaries produced during evaluation are written to../results/DATASET_NAME/MODEL_TYPE_MODE_[TOPICS].txt