zomato-restaurant-clustering-and-sentiment-analysis: A Jupyter Notebook repository from connect-midhunr

Link to deployed model: http://sentimenent-analysis-zomato-review.ap-south-1.elasticbeanstalk.com/

In this project, I have attempted to analyze the metadata and reviews of popular restaurants in Hyderabad and build machine learning models to cluster the restaurants into different segments based on cuisines and analyze the sentiments of the reviews given by the customers.

💾 Project Files Description

This project contains an executable iPython Notebook, a presentation and source as follows:

Executable Files:

Zomato_Restaurant_Clustering_and_Sentiment_Analysis.ipynb - Google Colab notebook containing data summary, exploration, visualisations, feature engineering, text processing, modelling, performance evaluation and conclusion.

Documentation:

Presentation PDF - Unsupervised Machine Learning - Zomato Restaurant Clustering and Sentiment Analysis - Capstone Project.pdf - Presentation slideshow of the project.

Source Directory:

Data & Resources.zip - Includes metadata and review data of restaurants listed by Zomato in Hyderabad.

📖 Problem Statement

Zomato is an Indian restaurant aggregator and food delivery start-up founded by Deepinder Goyal and Pankaj Chaddah in 2008. Zomato provides information, menus, and user-reviews of restaurants, and also has food delivery options from partner restaurants in select cities. The main objective is to understand the existing data and analyze their trends and patterns, so that machine learning models can be built, one for segmentation of restaurant types and another for sentiment analysis of reviews.

📖 Approach

Understanding the business task.
Reading data from files given and providing a summary.
Data cleaning, which involves removing irregularities in the data.
Exploratory data analysis, to find which factors affect sales and how they affect it.
Feature engineering, to prepare data for modelling.
Text Processing, to convert text to numeric data for modelling.
Modelling data (for both clustering and sentiment analysis) and comparing the models to find out the most suitable one for forecasting.
Conclusion.

📖 Exploratory Data Analysis

The following insights were gained from EDA:

Collage - Hyatt Hyderabad Gachibowli is the most expensive restaurant and Mohammedia Shawarma, and Amul are the most affordable ones.

North Indian cuisine is the most popular cuisine.

Anvesh Chowdary is the most experienced reviewer while Satwinder Singh is the most popular one.

AB's - Absolute Barbecues is the highest rated restaurant.

Some linear relationship exists between the average rating of restaurants and the cost of food.

📖 Modelling

🖨️ Restaurant Clustering Based on Cuisines

💹 Clusters by K Means Algorithm

💹 Clusters by DBSCAN Algorithm

🖨️ Restaurant Clustering Based on Cost and Rating

💹 Clusters by K Means Algorithm

💹 Clusters by DBSCAN Algorithm

🖨️ Modelling for Sentiment Analysis

💹 Comparison of Models

💹 Performance of Model after Hyperparameter Tuning

📖 Deployment

A web application is built to demonstrate the working of the trained machine learning model using a combination of HTML, CSS, and JavaScript.

The prediction of sales using the trained ML model is carried out via a Flask API.

This web application is deployed with AWS Elastic Beanstalk, employing CI/CD pipeline.

Link to deployed model: http://sentimenent-analysis-zomato-review.ap-south-1.elasticbeanstalk.com/

📘 Conclusion

The following conclusions were drawn from Modelling:

Either of the two models, trained using K means algorithm or DBSCAN algorithm, can be chosen for clustering the restaurant dataset based on cuisines, depending on the number of clusters preferred and whether or not outliers be included.

The model built using K means algorithm is selected for clustering based on cost and ratings.

For sentiment analysis, the model built using random forest algorithm was chosen over others.

If model interpretability is more important than accuracy, model built using logistic regression should be chosen. Since the difference between accuracy of these two models is less than 7%, there won't be much difference in the result.