Performed Exploratory Data Analysis, Data Cleaning, Data Visualization and Text Featurization(BOW, tfidf, Word2Vec). Build several ML models like KNN, Naive Bayes, Logistic Regression, SVM, Random Forest, etc.
Given a text review, determine the sentiment of the review whether its positive or negative.
Data Source: https://www.kaggle.com/snap/amazon-fine-food-reviews
Notebooks in more readable form on Jupyter Nbviewer[https://nbviewer.jupyter.org/github/cyanamous/Amazon-Food-Reviews-Analysis-and-Modelling/tree/master/]
The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.
Number of reviews: 568,454
Number of users: 256,059
Number of products: 74,258
Timespan: Oct 1999 - Oct 2012
Number of Attributes/Columns in data: 10
Attribute Information:
- Id
- ProductId - unique identifier for the product
- UserId - unqiue identifier for the user
- ProfileName
- HelpfulnessNumerator - number of users who found the review helpful
- HelpfulnessDenominator - number of users who indicated whether they found the review helpful or not
- Score - rating between 1 and 5
- Time - timestamp for the review
- Summary - brief summary of the review
- Text - text of the review
- Defined Problem Statement
- Performed Exploratory Data Analysis(EDA) on Amazon Fine Food Reviews Dataset plotted Word Clouds, Distplots, Histograms, etc.
- Performed Data Cleaning & Data Preprocessing by removing unneccesary and duplicates rows and for text reviews removed html tags, punctuations, Stopwords and Stemmed the words using Porter Stemmer
- Documented the concepts clearly
- Plotted TSNE plots for Different Featurization of Data viz. BOW(uni-gram,bi-gram), tfidf, Avg-Word2Vec(using Word2Vec model pretrained on Google News) and tf-idf-Word2Vec
- Applied K-Nearest Neighbour on Different Featurization of Data viz. BOW(uni-gram,bi-gram), tfidf, Avg-Word2Vec(using Word2Vec model pretrained on Google News) and tf-idf-Word2Vec
- Used both brute & kd-tree implementation of KNN
- Evaluated the test data on various performance metrics like accuracy, f1-score, precision, recall,etc. also plotted Confusion matrix using seaborne
- Best Accuracy of 85.107% is achieved by Avg Word2Vec Featurization
- The kd-tree and brute implementation of KNN gives relatively similar results
- KNN is a very slow Algorithm compared to others takes alot of time to train
- KNN did not fair in terms of precision and F1-score. Overall KNN was not that good for this dataset
- Applied Naive Bayes using Bernoulli NB and Multinomial NB on Different Featurization of Data viz. BOW(uni-gram,bi-gram), tfidf, Avg-Word2Vec(using Word2Vec model pretrained on Google News) and tf-idf-Word2Vec
- Evaluated the test data on various performance metrics like accuracy, f1-score, precision, recall,etc. also plotted Confusion matrix using seaborne
- Printed Top 25 Important Features for both Negative and Positive Reviews
- The best thing about Naive Bayes it much quicker than algorithms amazingly fast training times
- Best Models are Bi-Gram with accuracy of 89.53% and precision of 0.594
- Multinomial Naive Bayes does not work with negative values
- Naive Bayes fails miserably with featurization of Word2Vec and tfidf Word2Vec as Word2Vec feature are completely dependent while Naive Bayes is based on assumption of feature independence
- Applied Logistic Regression on Different Featurization of Data viz. BOW(uni-gram,bi-gram), tfidf, Avg-Word2Vec(using Word2Vec model pretrained on Google News) and tf-idf-Word2Vec
- Used both Grid Search & Randomized Search Cross Validation
- Evaluated the test data on various performance metrics like accuracy, f1-score, precision, recall,etc. also plotted Confusion matrix using seaborne
- Showed How Sparsity increases as we increase lambda or decrease C when L1 Regularizer is used for each featurization
- Did pertubation test to check whether the features are multi-collinear or not
- Features are multi-collinear i.e. they are co-related
- Bigram Featurization performs best with accuracy of 93.704 and F1-Score of 0.808
- Sparsity increases as we increase lambda or decrease C when L1 Regularizer is used
- Algorithms like SVM & Logistic Regression performed best on this data
- Applied SVM with rbf(radial basis function) kernel on Different Featurization of Data viz. BOW(uni-gram,bi-gram), tfidf, Avg-Word2Vec(using Word2Vec model pretrained on Google News) and tf-idf-Word2Vec
- Used both Grid Search & Randomized Search Cross Validation
- Evaluated the test data on various performance metrics like accuracy, f1-score, precision, recall,etc. also plotted Confusion matrix using seaborne
- Evaluated SGDClassifier on the best resulting featurization
- Support Vector Machine(SVM) gave the best result better than other algos close to Logistic Regression
- Tf-idf Featurization(C=1000,gamma=0.005) gave the best results with accuracy of 91.667% and F1-score of 0.733
- SVM with RBF kernel the separating plane exists in another space - a result of kernel transformation of the original space. Its coefficients are not directly related to the input space. Hence we can't get the feature importance
- Also tried SGDClassifier with the best result i.e. with tfidf featurization it was very quick and gave around same score in just seconds with accuracy of 91.04% and F1-score of 0.734 with (alpha=1e-05,penalty='l1')
- Applied Decision Trees on Different Featurization of Data viz. BOW(uni-gram,bi-gram), tfidf, Avg-Word2Vec(using Word2Vec model pretrained on Google News) and tf-idf-Word2Vec
- Used both Grid Search with random 30 points for getting the best max_depth
- Evaluated the test data on various performance metrics like accuracy, f1-score, precision, recall,etc. also plotted Confusion matrix using seaborne
- Plotted feature importance recieved from the decision tree classifier
- Decision Trees on Uni-gram, bi-gram and tfidf would have taken forever if had taken all the dimensions as it had huge dimension and hence tried with max 300 as max_depth
- Bi-gram Featurization(max_depth=73) gave the best results with accuracy of 85.11% and F1-score of 0.513
- Plotted feature importance for Uni-gram, bi-gram and tfidf but not for Avg Word2Vec and Tfidf Avg Word2Vec as Word2Vec featurizations are highly co-related hence can't directly get the feature importance