Welcome to my brief yelp dataset project.

The goal of this is to perform some data analyses with the online yelp dataset found at https://www.yelp.com/dataset.

A short description for each one

filtering_database.py

Uses command line arguments via argparse.
Currently gives csv of business_id, name, address, city, state, rating, review_count and business attributes
These can be changed by changing the "desiredFields" variable.

tabulate_attributes.py

Uses command line arguments the same way as filtering_database.py to query the data
Then creates columns out of the attributes for use in a pivot table
Gives back only the name, city, and attributes.
The attributes dictionaries are unpacked to generate purely tabular data.
Python truthiness is used to convert the True/False into 1/0.

calculate_sentiment.py

Pulls the reviews dataset. This is too large for my RAM so I pull items in batches.
Assigns a sentiment score to every single review in the dataset.
For the sentiment analysis, I use both VADER and textblob.
Data cleaning is lowercasing, punctuation removal, and stopword removal.
Note: there could be some value to not cleaning the text that I have not extensively tested. On whole VADER performs well with punctuation and emojis. In addition, some textblob sentiment calculations seemed more accurate by eye test when no cleaning was done.
The scores from vader and textblob are combined by averaging.
In order to compare to the standard yelp 5 stars rating, the average is normalized to be on the interval [0,5] to give a composite rating score.
Note: I also did not decide to take into account subjectivity score from textblob or other aspects of the review (such as the review stars and reactions).
Future additions include taking these extra factors into account. The subjectivity could be used to weight the combination of VADER and textblob.

sentiment_divergence.py

Takes the composite sentiment and groups by business_id, and computes an aggregate business sentiment across all reviews by taking a simple average.
Note: The review stars and reactions could be used to weight the contribution of each individual review to hte aggregate sentiment.
Joins the business data from the business dataset with the aggregate sentiment on the business_id and outputs the sentiment divergence, the difference between the actual star rating and the star-rating-normalized aggregate sentiment.
One might imagine then making a cut on absolute divergence to determine which businesses need further review.