An entry to Bag of words meets bag of popcorns using word2vec in R
To get competion data, click here
####Packages needed:
- rword2vec
- Rcpp and RcppArmadillo
- rpart and randomForest
- tm
####Code Explanation:
- Word vectors are obtained by using rword2vec package.
- Binary output file is converted into text file for further processing.
- To create training dataset for sentiment classification for reviews using word vectors obtained above, two popular methods can be used:
- Vector Averaging
- Clustering
- In first methods, we have to do vector averaging for each row of labeled and test dataset. There are many ways to do this but I have done this part using Rcpp and RcppArmadillo (R interface to C++) to avoid these compute intensive operations.
- In clustering,we are doing bag of centroids instead of bag of words. This part is also done using Rcpp and RcppArmadillo to optimize speed.
- Finally, classsification is done using random forest.
####Note: I'd recommend to read this python tutorial series first for better understanding of vector averaging and clustering.
####Test dataset results:
Classification using Vector Averaging
Classification using Clustering
####Results:
- Accuracy obtained for averaging and bag of centroids is more than their respective threshold but it is still very less.
- Accuracy can be improved using different machine learning algorithms like GBM,xgboost,neural networks etc and using techniques like stacking, blending, bagging etc.