Code in support of this post: The Simpsons by the Data
It's a Rails app, but isn't intended to be run as a server. It processes data from Simpsons World, Wikipedia, and IMDb, and populates a PostgreSQL database called simpsons_development
. The database contains 4 primary tables: episodes
, script_lines
, characters
, and locations
Assumes you have Ruby and PostgreSQL installed
git clone git@github.com:toddwschneider/flim-springfield.git
cd flim-springfield/
createdb simpsons_development
bundle exec rake db:migrate
bundle exec rake import_data
bundle exec rake jobs:work
It takes about 45 minutes to process everything with one worker
R code to analyze the data lives in the analysis/
folder
- I deduped some character names when they're printed in different ways, e.g. "TROY" is the same as "Troy McClure", but I certainly did not dedupe all 6000+ characters that appear in the scripts
- Similarly I manually assigned genders to the top 320 or so characters, who collectively account for 86% of the show's dialogue
- I did not dedupe any locations