
Find larger, processed dataset

Current data set is composed of 1093 scripts in Unicode with strange tabbing / spacing. Would be nice to find a more processed data set.

Currently using CMU Movie Summary Corpus. May attempt to process and tag Acerbi's 1093 film scripts. Will attempt to integrate Cornell Movie-Dialogs Corpus.

Currently using Cornell Movie-Dialogs Corpus for dialogue, by pre-seeding each dialogue with the output of the previous dialogue. For meta-text (prose-like scene descriptions), I am aiming to build a prose corpus, and pre-seeding meta-text with preceding dialogue. For the prose corpus, I am looking for books / essays that are primarily narration (i.e. not dialogue; so not Mark Twain), and have the simple, clear, modern syntactical structure (so, not Dickens or Austen or anybody prior to 1950, really). I also have to be reasonably able to find the works somewhere in a plaintext file on the internet, so the books would probably need to be over a decade old, or otherwise in the public domain.