This project has been done in fulfillment to the Second Capstone Project requirement of Springboard Data Science Career Track Bootcamp. The work on the project was mentored by Alex Chao.
The project objective is to build a hybrid recommendation engine by mixing recommendations from content-based filtering, collaborative firltering and friends network.
Recommendation systems have become an integral part of many businesses. They produce individualized recommendations as output or have the effect of guiding the user in a personalized way to interesting objects in a larger space of possible options.
In this project, we want to apply machine learning algorithms to develop predictive models and hence build a hybrid restaurant recommendation system that suggests the most suitable restaurants for users based on their preferences. We also plan to add consumer’s friends’ recommendations to the equation for additional personalization.
This project uses the Yelp dataset available at https://www.yelp.com/dataset
The data set contains 4,700,000 reviews on 156,000 businesses in 12 metropolitan areas 1,000,000 tips by 1,100,000 users Over 1.2 million business attributes like hours, parking, availability, and ambience.
The data files are supplied in two flavours: json and SQL (MySQL, Postgres). This project utilizes the json version which has the following files:
- business.json: Contains business data including location data, attributes, and categories.
- review.json: Contains full review text data including the user_id that wrote the review and the business_id the review is written for.
- user.json: User data including the user's friend mapping and all the metadata associated with the user.
- checkin.json: Checkins on a business.
- tip.json: Tips written by a user on a business. Tips are shorter than reviews and tend to convey quick suggestions.
- photos: (from the photos auxiliary file) This file is formatted as a JSON list of objects.
Each file is composed of a single object type, one JSON-object per-line. Description available at (https://www.yelp.com/dataset/documentation/json)
As the focus of this project is on building a recommendation engine, the core files will be used are business.json
, review.json
, and user.json
. Additionally, taking into consideration the limited hardware resources used available, data for the city of Toronto will be considered for building the recommendation engine.
Both content-based filtering and collaborative filtering have their strengths and weaknesses. We plan to mix both recommendations methods in addition to Friends ratings to provide a better personalized recommendation system.
Yelp users give ratings and write reviews about businesses and services on Yelp. These reviews and ratings help other Yelp users to evaluate a business or a service and make a choice. While ratings are useful to convey the overall experience, they do not convey the context which led a reviewer to that experience. We plan to apply various Natural Language Processing (NLP) and Text Analytics techniques on the review text to build features for the content filtering recommendations. The ratings will be used in the collaborative filtering recommendations.
User’s reviews on restaurants receive tags from other users as either the review is useful, cool, or funny. Additionally, the user data has a list of all fiends for each user. Those two facts will be combined for friend’s recommendations. That is to get a list of restaurants not reviewed by a user before, based on the ratings of his friends who received most useful tags.
A hybrid recommendation engine will be built to mix the results of the three above recommendations methods. Moreover, key words search recommendations will also be developed based on the featurizaion built on the content filtering part, to accommodate cold-start problem, i.e. new users with no history in the system.
Apache Drill (https://drill.apache.org/), a SQL query platform that provide query interface to most non-relational datastore, will be used for preliminary analysis of json files, and then used to extract the required data into Apache Parquet files - a columnar storage format.
Apache Spark (https://spark.apache.org/) is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. Parquet files generated by Apache Drill will be loaded into Spark dataframes for further analysis and machine learning modeling.
The below figure describes the methodology, data processing, and technologies which will be utilized in this hybrid recommendation system.