/Reviewer

Tool for extracting and analyzing IMDB reviews

Primary LanguageJupyter NotebookMIT LicenseMIT

Code for scraping IMDB reviews and applying NER and Sentiment Analyses to extract Character popularity

Reviewer can be used to scrape user reviews from IMDB, generate word clouds based on a custom class-based TF-IDF, and extract popular characters/actors from reviews using a combination of Named Entity Recognition and Sentiment Analyses.

Table of Contents

  1. Instructions

    a. Scrape

    b. Word Cloud

    c. Character Analysis

  2. c-TF-IDF

  3. Disney

  4. Sources

1. Instructions

Back to ToC

I would advise you to start with the notebooks/Overview.ipynb notebook for a good introduction before going to the command line. Moreover, scraping multiple movies is actually preferred as it allows you to use the class-based TF-IDF.

Instead, you can dowload or fork this repo and start with the instructions below.

1.a Scrape

To scrape a single movie (e.g., Aladdin), simply run from the command line:

python scraper.py --prefix aladdin --url https://www.imdb.com/title/tt0103639/reviews?ref_=tt_ov_rt

Make sure to select the url of the review page of the movie you want to scrape. The prefix variable is the name used for saving the resulting .json file.

Not only is the movie scraped, count data is also extracted if it is a single movie. If you want to apply the class-based TF-IDF, I would suggest to follow the instructions at notebooks/Overview.ipynb.

1.b Word Cloud

Make sure that you save an image that you want to be used as a mask. It is important the background is white and the file saved as a .jpg.

Then, after scraping the reviews, run the following from the command line:

python scrape.py --path "data/aladdin_count.json" --mask your_mask.jpg --pixels 1200

The data/aladdin_count.json is the file saved after running the scraper. The name thus depends on the prefix at the scrape stage.

The result is something like this:

1.c Character Analysis

We want to extract, from the reviews, which characters and actors are often talked about. We start by using Named Entity Recognition to extract the entity "Person" from reviews. Then, in the sentence where the entity "Person" is found, we apply sentiment analysis to understand the sentiment about that character. In other words, we extract often talked about characters combined with how positive those characters are regarded.

To do this, I made use of Named Entity Recognition and Sentiment Analysis using pre-trained BERT models.

After having scraped the review data, run the following from the command line:

python char.py --movie Aladdin --extract True --fast True --prefix disney --rpath disney_reviews.json

The result will be the following visualization:

2. Class-based TF-IDF

Back to ToC

This project uses a custom TF-IDF used for exploring words that are interesting based on the differences between classes. In other words, words are only important if they are often mentioned in one class, but not so much in all other classes. I call it a class-based TF-IDF (c-TF-IDF):

The above formula can best be explained as a TF-IDF formula adopted for multiple classes by joining all documents per class. Thus, each class is converted to a single document instead of set of documents. Then, the frequency of words t are extracted for each class i and divided by the total number of words w.

Next, the total, unjoined, number of documents across all classes m is divided by the total sum of word i across all classes.

3. Disney

Back to ToC

Initially, this project was meant for me to be used only for Disney and Pixar movies (as I enjoy those very much), but eventually I generalized the code to be used for, in principle, all movies.

However, I also analyzed the most popular characters (by relative frequency in reviews) and created a visualization of it below:

Moreover, you will find some Disney snippets here and there that I purposefully did not remove as there were some manual fixes to get the visualizations working and the pipeline running. See notebooks/Overview.ipynb for more information on how to run that code.

4. Sources

Back to ToC

An overview of all sources used in this package (mainly images for masks). All rights on these images belong to Disney, Pixar, and Marvel.

Mask Images
Disney Popularity Images