Word2Vec
Word2Vec implementation in Python and using tensorflow, with application to learning word vectors on an IMDB movie-reviews dataset.
Investigates several models, including a bag of words model, a neural network that learns from clusters of word-vectors, and an RNN whose inputs are word-vectors.
Usage
make
: Will run all of the models (as well as output accuracy reports), download the necessary data, and write the preprocessed data to persistent storage so future runs don't have to do preprocessing/redownloading.
make
targets:
-
bag-of-words
: runs the bag of words model and writes an accuracy report to filebag-of-words/bag-of-words-acc.txt
-
word-vectors
: runs a Tensorflow model to learn word vectors from the review data, and outputs a picture of a TSNE word vector visualization to filetf-implementation/TSNE.png
-
cluster-model
: runs a feature engineering algorithm that generates features based on the clustering of a review's word vectors (ie, the frequency of each word vectors in a particular cluster in the review) followed by a Tensorflow neural network model whose inputs are these clusters. An accuracy report is written to filetf-implementation/cluster-acc.txt
. -
rnn-model
: runs a Tensorflow RNN model who's inputs are the learned word vectors made by theword-vectors
target. An accuracy report is written to filetf-implementation/rnn-acc.txt
.