Kaggle Challenge - Amazon Reviews: A task to identify the polarity of mobile phone reviews posted on Amazon.
Comparison of custom trained Gensim embedding model to pre-trained word embedding models (Google word2vec and Stanford GloVe).
Tokenization, followed by removing stopwords, special characters and abbreviations.
Using Common Bag of Words and tf-idf values to determine feature points.
Building 3 different models using Gensim, Word2Vec and GloVe respectively. Words represented as one-hot-vectors are transformed to equivalent numeric vectors of dimension size 300. Similarity between words calculated by cosine of corresponding numeric vectors. This is represented as follows:
Each processed review is replaced with the mean of the word-vectors and is plotted in the word-vector space.
The reviews are then categorized using RandomForestClassifier into good(1), neutral(0) and bad(-1). These were tested against the actual class values provided in the dataset.
- Google Word2Vec : 80.35 %
- Stanford GloVe : 80.03 %
- Gensim Word2Vec : 83.26 %