
This repository contains code for the Plot-guided-Coherence-Evaluation paper. For citation please use the following citation:

  title={Plot-guided Adversarial Example Construction for Evaluating Open-domain Story Generation},
  author={Sarik Ghazarian and Zixi Liu and Akash S M and Ralph Weischedel and Aram Galstyan and Nanyun Peng},
  booktitle={The 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},

Install Requirements

Please use requirements.txt file to get all the necessary packages to run the code. In our manipulations, we use COMET model to manipulate the logically ordered plots. You can download the model from Download and save the model in a new directory called "pretrained_models".

Data Creation, the evaluators training and testing steps

  1. Our proposed four different approaches including Non-logically Ordered Plots, Contradiction Insertion, Repetition Insertion and Random Substitution manipulations can be applied by:

       python --data_dir Data/WP/WP_Eval --fname WP_train

       python --data_dir Data/ROC/ROC_Eval/ --fname Rocstories_train

Note: To run these two codes, you need to first install and then put these codes in the home directory and run from there to get the manipulated plots.

  1. In order to generate implausible stories conditioned on the manipulated plots, we use BART model as a conditional LM. We have finetuned BART on both ROC_LM and WP_LM data for three epochs using Fairseq. You can download these models from ft_BART_ROC and ft_BART_WP. The BART model finetuned on ROCstories dataset should be placed in Models/Ft_BART_Story_Generator/ROC/ while the finetuned BART model on the WP dataset should be placed in Models/Ft_BART_Story_Generator/WP/.

  2. We leverage the finetuned BART models to generate 6 different negative samples for each plausible story and then use the Adversarial Filtering (AF) technique proposed by Zellers et al. (2019) to select the three most challenging implausible ones for the evaluator. To generate negative samples and make the data ready for applying AF technique run:

       python --num_negative_samples 6

       python --num_negative_samples 6

       You can set different generation parameters for generating various implausible stories.

  1. We follow the code for AF on the Data/WP/WP_Eval/WP_AF_input.json and Data/ROC/ROC_Eval/ROC_AF_input.json data to select the challenging implausible stories.

  2. The output from AF technique is in json format. We convert it to tsv format which is a suitable input format for our evaluators. In this format, we have one plausible story with the label "1" and three implausible stories with the label "0".



  1. We use the code from huggingface to finetune RoBERTa model for ROCstories and Longformer for WP dataset. You can download the evaluators from ft_roberta and ft_longformer. These models should be placed in Models/Ft_RoBERTa/ and Models/Ft_Longformer/ directories respectively. We also use code to predict the scores for the test data.

  2. In order to examine the performance of our evaluators, we have collected human judgments through AMT. Data/AMT/AMT_ROC.csv and Data/AMT/AMT_WP.csv files consist of these human evaluations. To get the Spearman and Kendall correlations between predicted scores using our evaluators and human judgments you can run:

       python --data ROC

       python --data WP

For any comments or issues feel free to contact me.