/Dialogue-Datasets

A collection of plain text dialogue datasets

The UnlicenseUnlicense

Dialogue Datasets

A collection of plain text datasets I have found. You can find a zip of all of them here. This contains:

BNC Corpus

BNCCorpus.txt is the subset of the British National Corpus that is transcribed unscripted spoken dialogue, in plain text.

BNCSplitWordsCorpus.txt is the same except I used this to split apart some of the words in the corpus because the original text had a lot of wordsthatwerecombinedlikethis.

I don't claim to have any liscensing/ownership of this, I just made it myself from parsing the raw xml dump so if I'm not allowed to distribute this just let me know and I can take it down.

Twitter dialogue dataset

I made this one from parsing tweets and their replies. It contains conversations (2 or more tweets), each tweet is on a seperate line and there is three empty lines between each dialouge, and they are sorted by length of dialogue. TwitterConvCorpus.txt contains emojiis and such, TwitterLowerAsciiCorpus.txt contains only dialouges of length 4 or more, converted to lowercase (because most of the text is already lowercase becauce twitter) and all of the non-ascii characters removed.

Movie Corpus

A collection of movie scripts, I got this from here