The dataset is a new narrative understanding benchmark to predict personality according to the character’s narrative texts in the script. We release the dataset and the codes for our work accepted to NAACL Student Research Workshop 2022: Machine Narrative Comprehension in Fictional Characters Personality Prediction Task and EMNLP 2022 MBTI Personality Prediction for Fictional Characters Using Movie Scripts.
conda env create -f person_environment.yml python=3.8 pandas=1.5.2
conda activate person
python -m spacy download en
Our data parser first reads the narrative books and movie scripts from HTML files, and then extracts utterances said by recognized characters. The whole process can take 3~5 hours to finish. If you are only interested in the data, you can download them via this link and unzip to the root folder.
# move the downloaded "dialog_scene_mention_dicts.zip" to the root folder
unzip dialog_scene_mention_dicts.zip
If you would like to know how the raw text data is processed, you will have to download the HTML files first from OneDrive. The contents are the union of NarrativeQA dataset and Movie-Script-Database. Please unzip the downloaded file to the root repo folder.
# move the downloaded "raw_texts.zip" to the root folder
unzip raw_texts.zip
We are also sharing some other preprocessed files in the preprocessed/ folder which are also the dependencies of our parser. The following command would generate dialog_dict.pickle, scene_dict.pickle, and mention_dict.pickle from scratch.
python parse.py
Hereto, you will get three .pickle
files which contain dictionaries of "what people say" and "who are mentioned" in a dialogue or a scene.
To use the data for modeling, please go to dataset/ and download one of the tokenized datasets. The format is more readily for training and testing than those .pickle
files. More details will be provided in the future.
If you find this repo useful, please consider citing our paper:
@article{sang2022mbti,
title={MBTI Personality Prediction for Fictional Characters Using Movie Scripts},
author={Sang, Yisi and Mou, Xiangyang and Yu, Mo and Wang, Dakuo and Li, Jing and Stanton, Jeffrey},
journal={arXiv preprint arXiv:2210.10994},
year={2022}
}