/television_show_recommender

I used transcripts of television episodes to look for patterns between shows using the Latent Dirichlet Allocation (LDA) model, which clusters similar shows based upon the two shows use similar words at similar frequencies.

Primary LanguageJupyter Notebook

Television Show Recommender System


Executive Summary


I analyzed the transcripts of 117,937 television episodes from 4,667 different television shows using Latent Dirichlet Allocation in order to find clusters of common language between different shows. and to then take those similarities to build a content based recommender for television shows.

System Requirements


  • Python==3.7.3
  • gensim==3.8.1
  • Flask==1.1.1
  • nltk==3.4.5
  • pandas==0.25.2
  • matplotlib==3.1.1
  • numpy==1.17.2
  • spacy==2.2.1
  • spacy-langdetect==0.1.2
  • beautifulsoup4==4.8.0

For Google Cloud Virtual Instance:

  • need Virtual Machine with at least 104 GBs of RAM
  • google-api-core==1.14.3
  • google-auth==1.7.1
  • google-auth-oauthlib==0.4.1
  • google-cloud==0.34.0
  • google-cloud-core==1.0.3
  • google-cloud-storage==1.23.0
  • google-pasta==0.1.8
  • google-resumable-media==0.5.0

How to Use this Repository


All final production code is in the final_code folder, while the development_code folder contains other pieces of code written during the project that ended up not being used to create the final result. The notebooks Python scripts are listed in chronological order. None of my final data is posted because of its size (2.6 GBs), but please contact me if you would like a copy!