/NLP-Yelp

Using TF-IDF analysis and cosine similarity to perform NLP processing of Yelp reviews and Open table reviews. The goal of this project is determine how similar 'gilded' reviews (Yelp Elite reviews & VIP Open Table reviews) compare to their normal review counterparts.

Primary LanguagePython

Text Similarity Analysis

Authors: Dr. Ann Kronrod, Dr. Bart Yakov, Marty Vo


Implementation Details

The cosine_similarity.py script works with the dataset provided by Yelp to calcualte the textual similarity between reviews. The script will parse, lemmatize, and perform a term frequency (TF) analysis on the reviews provided in the sample.txt file. More specifically, it performs a TF-IDF analysis, which stands for Term Frequency (TF) - Inverse Document Frequency (IDF). We then use the weighted values of the terms in the reviews to compare the similarity between reviews by using cosine similarity. As we continue to analyze larger data sets, modifications will be added to script as necessary.

This script parses the sample review data and lemmatizes the individual reviews by creating tokens for each term. Stopwords, such as "to", "the", and others were removed from the lemmatized list of reviews. Punctuation was also removed from the lemmatized lists as it is not useful in determining the relevance of a term in a document. Parsing and lemmatization was handled by the nltk library which can be found here.

The term frequency, inverse document frequency, cosine similarity, and all intermediate matrix transformations were calculated using scikit-learn's library. You can find more information about the calculations and methods here.

This script will return the ouputted cosine similarity values to an output file. There were several criteria used to decide which reviews to compare against each other. The first method analyzed a focal review against the ten previous chronological reviews. The second method analyzed a focal review against the the ten previous chronological reviews with the same number of stars. The third method analyzed a focal review against the ten previous chronological reviews with the same number of accolades.

Support for OpenTable reviews can be found in the open_table.py script. This analyzes reviews from OpenTable in a similar fashion.


Research

Research regarding methods to determine text-similarity can be found here


Citation

The basis for this script was taken from the article below:

Other resources used to create this script: