The project is Web Scraping and Sentiment Analysis For Hotels Reviews of www.yelp.com. This project have 2 sub parts in which work was carried out and learning was done accordingly:
WEB SCRAPING: This part deals with the creation of an autonomous system that can scrap the large amount of data that is Reviews in this case. The system uses some automation tools for automating the tasks that are required without the intervention of humans. Without this program it would take days to save the data collected by humans and doing this process.
SENTIMENT ANALYSIS This part deals with the analysis of the reviews collected by the user and classify them as positive or a negative sentiment. This is generally done to check the sentiment of a product by the terms of feedback provided by the user. In this system a model is created that uses some classification algorithm and the accuracy of prediction is done. This Analysis helps the companies to test the long set of data about their product and can then check whether its going good or bad.
By some estimations it is calculated that 80% of the data is unstructured and present in the form of text, surveys, emails etc. These Data are generated each day and are very difficult to analyse , understand and to process such a huge data it is very expensive and time consuming. Understanding customer opinions and feedback is a very crucial part to grow any business since by analysing their feedback it is very easy to get back to customers with a proper service and needs that are required by them. So sentiment analysis is the process or a machine learning technique used to detect the polarity from the text data into positive or negative. it can be a best way to analyse huge data of any document or set of documents. it helps businesses to have a meaning out of that textual feedback which could help them in knowing more about the product and services. The title of the project is Web Scraping and Sentiment analysis for hotels of www.yelp.com. yelp.com is a website which publishes reviews about different businesses such as spa, restaurants etc. The project deals with creating a machine learning model which predicts the user reviews as positive or negative since we know that in any business it is a very crucial part to track their feedback of the product they serve to users by the user’s review and by this product they will be able to check the sentiment of user towards their product by means of reviews in form of text by using some text analysis techniques. Workflow of the project: choosing of training dataset from kaggle scraping data from yelp.com importing datasets data preparation and cleaning bag of words and count vectorization (stemming and stopwords removal ) calculation of tf-idf values datasets splitting training classification models using training set checking accuracy for different model
The Project involved the above tasks that needed to be done for the sentiment analysis. These steps had a lot of work in each and the explanation of all task carried are as follows:
Data collection : This is the first Task of the project which included searching of a perfect dataset from kaggle used for training the model this dataset included a lot of meta data collected together apart of this the data was collected from www.yelp.com by the data scraping program which uses selenium web driver. This data was for the testing purpose of the model.
Text Preparation : This Task included the preparation of the data which needs to be done . the following were the processes involved in this task: removing unused columns filling the null values concatenating the dataframes removing punctuations removing emojis removing stopwords tokenization stemming calculating tf-idf values
Sentiment Detection : This task deals with the creation for labels according to the star rating given by the user on their feedback. This uses a comparison for creating labels for the sentiment where 1 is for positive and -1 for negative sentiment .
Sentiment Classification : This task involved creating a machine learning model which uses some classification algorithm in their background. They classify them by the features matrix feeded to them which contains the numerical conversion of documents that contain all the textual feedback. In this task I have used 3 different models which I compare at the end to check which one is better. The models used are Support vector machine, logistic regression and naives bayes.
Presentation of output : The output presentation has been done by plotting some charts to show data such as pie charts and the predicted result have been shown by plotting confusion matrix for each model moreover a classification report and accuracy score is printed for each model.