This README describes data in the CMU Movie Summary Corpus, a collection of 42,306 movie plot summaries and metadata at both the movie level (including box office revenues, genre and date of release) and character level (including gender and estimated age). This data supports work in the following paper:

David Bamman, Brendan O'Connor and Noah Smith, "Learning Latent Personas of Film Characters," in: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL 2013), Sofia, Bulgaria, August 2013.

All data is released under a Creative Commons Attribution-ShareAlike License. For questions or comments, please contact David Bamman (dbamman@cs.cmu.edu).

DATA

  1. plot_summaries.txt.gz [29 M] Plot summaries of 42,306 movies extracted from the November 2, 2012 dump of English-language Wikipedia. Each line contains the Wikipedia movie ID (which indexes into movie.metadata.tsv) followed by the summary.
  2. corenlp_plot_summaries.tar.gz [628 M, separate download] The plot summaries from above, run through the Stanford CoreNLP pipeline (tagging, parsing, NER and coref). Each filename begins with the Wikipedia movie ID (which indexes into movie.metadata.tsv).

METADATA

  1. movie.metadata.tsv.gz [3.4 M] Metadata for 81,741 movies, extracted from the Noverber 4, 2012 dump of Freebase.
    Tab-separated; columns:
  2. Wikipedia movie ID
  3. Freebase movie ID
  4. Movie name
  5. Movie release date
  6. Movie box office revenue
  7. Movie runtime
  8. Movie languages (Freebase ID:name tuples)
  9. Movie countries (Freebase ID:name tuples)
  10. Movie genres (Freebase ID:name tuples)
  11. character.metadata.tsv.gz [14 M] Metadata for 450,669 characters aligned to the movies above, extracted from the Noverber 4, 2012 dump of Freebase. Tab-separated; columns:
  12. Wikipedia movie ID
  13. Freebase movie ID
  14. Movie release date
  15. Character name
  16. Actor date of birth
  17. Actor gender
  18. Actor height (in meters)
  19. Actor ethnicity (Freebase ID)
  20. Actor name
  21. Actor age at movie release
  22. Freebase character/actor map ID
  23. Freebase character ID
  24. Freebase actor ID

TEST DATA

tvtropes.clusters.txt 72 character types drawn from tvtropes.com, along with 501 instances of those types. The ID field indexes into the Freebase character/actor map ID in character.metadata.tsv.

name.clusters.txt 970 unique character names used in at least two different movies, along with 2,666 instances of those types. The ID field indexes into the Freebase character/actor map ID in character.metadata.tsv.