/Natural-language-processing

Modelling and sentence classification with convolutional neural networks

Primary LanguagePython

Natural-language-processing

Modelling and sentence classification with convolutional neural networks.

This project is based on the work of Nal Kalchbrenner, Edward Grefenstette, Phil Blunsom A Convolutional Neural Network for Modelling Sentences and the one of Yoon Kim Convolutional Neural Networks for Sentence Classification

Pre-processing

Vector description of words is based on the algorithm word2vec. In this project pre-trained words are loaded from word2vec applied on part of Google News dataset (about 100 billion words). The pre-trained words are available here: GoogleNews-vectors-negative300.bin.gz.

Our model is trained on the Movie Review dataset that can be downloaded here. This dataset separates movie reviews in two categories: positive and negative reviews

Since the Machine Learning algorithm learns from vector representations of words the dataset should be converted into a set of vectors. To perform that, pre_processing.py should be executed alone only a single time to convert the dataset using word2vec algorithm and store the result in the files dataset.pkl and labels.pkl . Note that the following code from pre_processing.py should be adapted to your configuration:

if __name__=='__main__':

    # Compute vector representation of words and store it on HDD

    word2vec_file = "word2vec/GoogleNews-vectors-negative300.bin"

    negative_path = "MR/reviews/neg"
    positive_path = "MR/reviews/pos"

After that you juste have to run the following code everytime you want to load the new dataset :

import pre_processing as pre

dataset,labels = pre.load_dataset("path/to/folder")

Training

The script train.py is in charge of the training. It imports pre_processing.py and Model.py and train the cnn model on the training set. In order to execute this code, make sure you have enough RAM, otherwise you will get a Memory Error while running :

print("Reshaping data...")
x_train, y_train = reshape_data(x_shuffle[:1500] ,y_shuffle[:1500])

If there is still a Memory Error, please try the following code :

import gc

gc.collect()