Affect-rich Dialogue Generation using OpenSubtitles 2018

Semester project in human interaction group (https://hci-test.epfl.ch/). Topic on dialogue generation.

Goal of this project

First we extract multi-turn dialogues from OpenSubtitles 2018 (segmentation is based on sentence similarity), then we adopt the affect-rich approach and MMI objective function to improve the basic Seq2Seq model.

Results

Parsing the OpenSubtitles:

OpenSubtitles 2018: we clean and parse the original OpenSubtitles 2018, and save lines with timestamps in .txt file. Only few samples are uploaded in Github for huge size of whole data (For whole processed OpenSubtitles 2018 please download from: https://drive.google.com/open?id=1ZUJ2J8ukuXhMKXd0SVG1pCZ5ZnJZ9Csc). Not labelled with characters, scenes or dialogue boundaries.
Scripts data set: created from 985 scripts and save in './dataset/scripts/script_data_set.csv'. Well labelled with characters, dialogue boundaries and movie names.

Dialogue segmentation

For dialogue segmentation part, we validate our method on Cornell Movie Dialog data set and test it on our own scripts data set, and finally we reach p_k 0.262 and 0.295 respectively. The results are saved in Google drive: https://drive.google.com/open?id=1niLIVTBJkhxRAUdLk67uthcDwxA2Pj0M In total we have more than 350,000 files in OpenSubtitles 2018, and it's hard to process all 350,000 files at once. So I splited the 350,000 files into seven parts, each part has 50,000 files. To segment every 50,000 files, it will need more than 7 hours. And currently we have session_segmentation_0.txt (more than 4,000,000 extracted dialogue).