/movies

Using Letterboxd personal data, TMDB API, and data science techniques to analyze movie watching data

Primary LanguageJupyter NotebookMIT LicenseMIT

movies

Using Letterboxd personal data, TMDB API, and data science techniques to analyze movie watching data

As an avid movie watcher, the opportunity to combine my loves of data science and film was too good to pass up. Having been a Letterboxd user for many years, I leveraged my personal watch history data to better understand my viewing patterns and learn more about myself in the process. For more detailed report on this project and its methodology, please check out my LinkedIn article and its follow-up detailing the credits code implementation.

The first iteration of this project utilized PowerBI to do the end reporting, but as of December 2023 I created a web UI version that utilizes MongoDB and the streamlit library! The two Python scripts necessary to add/update the data to a MongoDB collection have been added as optional scripts to this project. You can find the code repo on my GitHub and the end site can be accessed at letterboxd.streamlit.app

Who is this project for?

  • Cinephiles with a passion for coding
  • Developers interested in multiple areas of development (API calling, sentiment analysis, regression-based modeling, etc.)
  • Affinity users of other sub-communities (i.e. having a Goodreads instead of Letterboxd/IMDB/a movie logging site) who want to also derive personal analytics from their platform usage

Accessing The Data

  • Download personal movie data from Letterboxd: Setting -> Import & Export -> Export Your Data
  • You can also access the direct link to download your data here
  • Save the following files: watched.csv, ratings.csv, reviews.csv, and diary.csv
  • Request an API key from TMDB
  • Once API Key retrieved, use with movies_api.py to retrieve additional movie data

Usage Insights

  • The TMDB allows for 30-40 API requests every 10 seconds, so if you have thousands of movies logged as I do this could factor into the performance time of movies_api.py
  • If you've logged an limited series/prestige TV on the app (like the Emmy award winning limited series Big Little Lies) those won't have any TMDB API hits since it is pointed at the movie side of the database. I removed those records since they aren't within the scope of the project anyways.
  • Even though the dashboard was created in Power BI, I wrote the code in movies_eda.ipynb to re-create all the visualizations from the final dashboard. I included it as an ipynb rather than just a .py script so you could see the output of each code chunk, but a .py version would be suitable as well if using a different IDE

Script Execution Order

  1. movies_api.py
  2. Optional movies_hours.py
  3. movies_api_credits.py
  4. movies_api_credits_cleaning.py
  5. movies_cleaning.py
  6. movies_sentiment.py
  7. movies_modeling.py
  8. Optional movies_eda.py or movies_eda.ipynb
  9. Optional mongodb_create.py or mongodb_update.py

Data Dictionary

  • Logged_Date -- Date I logged the film on Letterboxd
  • Name -- Name of the film as it appears on Letterboxd's site
  • Year -- Generally, the year of the US release date. Can vary depending on whether it was released internationally or at film festivals first
  • Rating -- Records on a scale of 0 to 5 by increments of 0.5 the star rating I gave the film
  • Review -- Boolean value that preserves whether or not I wrote a review for the film on Letterboxd
  • id -- Unique identifying value in TMDB's database
  • english_language -- Boolean value that records whether or not the movie's original language is English. Considered breaking this value out further but over 90% of them are surprisingly listed as English language in TMDB
  • overview -- Provides brief synopsis of the film
  • popularity -- Internally calculated score based on site interaction data. More information about this feature can be found here
  • vote_average -- Average user rating of the film on a scale of 0 to 10
  • vote_count -- Total number of users who rated the film
  • vote_revenue -- Total amount of money grossed at the domestic and international box office
  • runtime -- Total running length of the film excluding commercials, measured in minutes
  • tagline -- Marketing verbiage which provides a punchy incentive for potential viewers to choose to watch the film
  • watch_count -- Number of times you have seen the film using diary entries
  • min_watched -- runtime * watch_count
  • Logged_DOW -- Extracts day of the week from the Logged_Date values, recorded in numeric form (0 - Monday, 1 - Tuesday, 2 - Wednesday, 3 - Thursday, 4 - Friday, 5 - Saturday, 6 - Sunday)
  • Logged_Month -- Extracts month value from the Logged_Date values
  • Logged_Year -- Extracts year value from the Logged_Date values
  • Logged_Week -- Calculates from 0 to 54 the week value from the Logged_Date values
  • Daily_Movie_Count -- Calculates using the Logged_Date values how many movies I watched on a given date
  • Weekly_Movie_Count -- Calculates using the Logged_Week and Logged_Year values how many movies I watched on a given week
  • genres -- Several boolean columns exist that indicate whether or not the movie was classified into the following genres: (Action, Crime, War, Drama, Thriller, Mystery, Comedy, Romance, Sci_Fi, Animation, Documentary, Adventure, Music, Horror, Fantasy, History, Western, Rom_Com)
  • female_roles -- Measures the number of female roles in the first 20 billed of a movie's acting credits
  • female_driven -- Boolean value that records whether 9 or more of those 20 roles are female, therefore classifying the film as "female-driven"
  • female_directed -- Boolean value that records whether or not the director of the film self-identifies as female
  • negativity_percentage -- Measures what percentage of the string input has a negative association
  • neutrality_percentage -- Measures what percentage of the string input has a neutral association
  • positivity_percentage -- Measures what percentage of the string input has a positive association
  • movie_sentiment -- The compound score is the aggregate sum of positive, negative & neutral percentages. The closer this value is to 1, the more positive the movie's overview is

Future Project Expansions

  • Integrate additional movie attributes such as the film's director, leading actors, and thematic content Completed Jan 2023 with "credits" expansion
  • Calculate total number of minutes and hours of movies watched using re-watch logs in the Diary dataset Completed Dec 2024 with movies_hours code
  • Rather than just the film's lanaguage, integrating country of origin to better understand domestic vs. international viewing
  • Left joining on the Diary dataset rather than Watched one to conduct time series analysis/predict what genre or type of movies I'll watch next Partially addressed in July 2024 with expansion to calculate minutes watched per film

Helpful Data Resources

image