Kaggle Challenge : Bag of Words Meets Bags of Popcorn

Min Gu Jo

Use TFIDF Vectorizer (sklearn) and Google's word2vec (gensim) to perform sentimental analysis on movie reviews

Overview

Google's Word2Vec is a deep-learning inspired method that focuses on the meaning of words. Word2Vec attempts to understand meaning and semantic relationships among words. It works in a way that is similar to deep approaches, such as recurrent neural nets or deep neural nets, but is computationally more efficient. This repo, in some degree, follows the tutorial (word2vec) provided by the competition.

Goal

Increase the accuracy of iMDB movie review sentiment prediction (positive/negative)

Methodology

Google's word2vec

Construct and save a 'Bag of Word' model using 50,000 iMDB movie reviews(.tsv) with Google's word2vec from 'gensim' library
Create clusters of semantically related words (bag of centroids) by clustering (K-means clustering)
Train a random foerest

TFIDF Vectorizer

Choose the hyper parameters and construct TFIDF vectorizer with both labeled and unlabeled train data set
Select the feautres with the top ten percentiles (reduce feature)
Train an ensemble (Multinomial Naive Bayes & SGDclassifier)

Directions

Clone the repo: `git clone https://github.com/mingujo/BagOfWords-Kaggle-.git'
Create 'data' folder, download provided data from https://www.kaggle.com/c/word2vec-nlp-tutorial/data, and unzip
Install python dependencies with pip: pip install -r requirements.txt
run utils/scripts/word2vec_model_buildingscript.py to build 'bag of words' model
run utils/scripts/clustering_script.py to train, predict, and create a submission file (This may take up to 1 hour) (run random_forest_script.py to train and predict without clustering)
run TFIDF_ensemble_script.py to build TFIDF model, train an ensemble model, and predict
check the submission files under model/

Accuracy

TFIDF vectorizer with an ensemble model : 96.180% accuracy (top 60th | top 11th percentile when submitted)
Word2Vec with K-means clustering and random forest : 85% accuracy (top 300th when submitted)

mingujo/BagOfWords-Kaggle-