NLP Analysis with Amazon Reviews Data

Goal:

Binary Classification

Using Amazon data to predict if the review is negative or positive

If rating <=3 negative review 0
If rating >=4 positive review 1

Multi-calss classification

This part is to use the users' review to predict the rating scores and the label(target) in this part is the rating score from 1 to 5. This is multi-class classification

Attribute information

helpful - helpfulness rating of the review [2,3], e.g. 2/3 , 2 is numerator , 3 id denominator Numerator: Number of readers who found the review is helpful Denominator: Number of readers who indicated whether they found the review helpful or not.
review Text - text of the review
overall - rating of the product
summary - summary of the review
unix Review Time - time of the review (unix time)

EDA

Bigram & World Cloud

Bigrams Shows that Top 5 words for positive reviews are all positive, on the other hand top 5 words for negative reviews either neutral or negative.

Review counts

Review counts graph shows that the numbers of positive reviews and negative reviews are not balanced, therefore oversampling method was applied.

Review Length

On the other hand we can see the mean of length of negative reviews is long than positive reviews. That makes sense, because when a person complains about a product that she/he does like , the person will talk a lot.

Review Counts by Time

In 2013 Amazon sells over 200 million products in the USA, which are categorised into 35 departments and almost 20 million in Sports & Outdoors

Data cleaning and preprocessing

Combined summary column and review column as “combined_text”
Created a target column based on rating column
Created two columns, one is helpfulness_numerator ,another one is helpfulness_denominator
Tokenization, Punctuation removal , stemming, lemmatization
After step 4. created a column - review_len which is length of review(text)
Data Resampling (Oversampling Method)

Modeling- Binary Classification

There are two parts at first level, text data analysis and non_text data analysis. For text data , Naive Bayes Classifier, Neural Network and Logistic Regression were applies and obtained each model's predictions (train data and test data ). For the non_text data, Random Forest , Neural Network and Logistic Regression were applied and generated the predictions. At second level, All six train-data predictions and six test-data prediction were combined as new features renamed as new x_train data and new x_text data. XGboost and Neural Network were applied to predict again. The result shows that model stacking gives slightly higher f1 score from 0.9059 to 0.9063. Eventhough model stacking may deliver better result, it is difficult to interpret the result.

Modeling- Multi-class Classification

Similary, this part contians two parts, in the first modeling part, Naive Bayes Classifier, Neural Network and Random Forest were applied and second part is included model stacking with XGboost model and Neural Network. In the first part, the F1 scores of Naive Bayes Classifier, Neural Network and Random Forest are 0.6067, 0.6085, 0.6186 respectively. Obviously Random Forest perfomed best with f1 score 0.6186. In second part, after model stacking the f1 scores of XGboost and Neural Network are 0.5893 and 0.6189. After model stacking we can see that f1 score slightly increased 0.0003.

melanieshi0120/NLP_Analysis_Amazon_Reviews