Modelling and sentence classification with convolutional neural networks.
This project is based on the work of Nal Kalchbrenner, Edward Grefenstette, Phil Blunsom A Convolutional Neural Network for Modelling Sentences and the one of Yoon Kim Convolutional Neural Networks for Sentence Classification
Vector description of words is based on the algorithm word2vec. In this project pre-trained words are loaded from word2vec applied on part of Google News dataset (about 100 billion words). The pre-trained words are available here: GoogleNews-vectors-negative300.bin.gz.
Our model is trained on the Movie Review dataset that can be downloaded here. This dataset separates movie reviews in two categories: positive and negative reviews
Since the Machine Learning algorithm learns from vector representations of words the dataset should be converted into a set of vectors. To perform that, pre_processing.py should be executed alone only a single time to convert the dataset using word2vec algorithm and store the result in the files dataset.pkl and labels.pkl . Note that the following code from pre_processing.py should be adapted to your configuration:
if __name__=='__main__':
# Compute vector representation of words and store it on HDD
word2vec_file = "word2vec/GoogleNews-vectors-negative300.bin"
negative_path = "MR/reviews/neg"
positive_path = "MR/reviews/pos"
After that you juste have to run the following code everytime you want to load the new dataset :
import pre_processing as pre
dataset,labels = pre.load_dataset("path/to/folder")
The script train.py is in charge of the training. It imports pre_processing.py and Model.py and train the cnn model on the training set. In order to execute this code, make sure you have enough RAM, otherwise you will get a Memory Error while running :
print("Reshaping data...")
x_train, y_train = reshape_data(x_shuffle[:1500] ,y_shuffle[:1500])
If there is still a Memory Error, please try the following code :
import gc
gc.collect()