This is a simple NLP project based on the NLP section of A-Z Machine Learning Course on Udemy
The objective of this exercise is to identify the best model for classifying the review comments of a restaurant. We clean the dataset and make vectors out of them according to the bag of words model.
6. Conclusion
Review | Liked | |
---|---|---|
0 | Wow... Loved this place. | 1 |
1 | Crust is not good. | 0 |
2 | Not tasty and the texture was just nasty. | 0 |
3 | Stopped by during the late May bank holiday of... | 1 |
4 | The selection on the menu was great and so wer... | 1 |
The dataset contains the review string followed by a binary flag indicating wheather the user liked it or not.
- Removal of punctuations and symbols
- Removing the stop words
- Tokenizing after stemming the different words.
- Building the vectors from the induvidual reviews.
\begin{bmatrix} 55 & 42 \ 12 & 91 \end{bmatrix}
$ accuracy = 0.73$
\begin{bmatrix} 74 & 23 \ 35 & 68 \end{bmatrix}
$ accuracy = 0.71$
\begin{bmatrix} 87 & 10 \ 46 & 57 \end{bmatrix}
$ accuracy = 0.72$
A sample predictor was created for implementing in our django app. The basic logic is to classify the comment with all the three models that we tried and then using the average of the result in order to predict the final result. This predictor takes the input in the form of a string.
\begin{bmatrix} 85 & 15 \ 32 & 71 \end{bmatrix}
$ accuracy = 0.765$
The three trained models were pickled using python's pickle library and then used inside the Django project.
In conclusion, we can say that none of these methods do a perfect job in classifying the reviews perfectly. However we can say that the best result was obtained for Random Forest Classifier. And even better result was obtained from the predictor function which aggregates the three classifiers. Another one factor we need to consider is that this model was built on only very limited dataset and has its limitations. Altogether we are able to get fairly good results for a basic implementatio on a web