TextClassification-with-MLP

Text Classification

The objective of this project was to familiarize with text classifiers and especially multilayer perceptron neural network. For this purpose, a dataset with customer reviews for restaurants was chosen. Reviews are classified in 3 classes (negative, neutral and positive), the dataset includes 2573 reviews which are considered a small number for training.

Project Desrciption

Due to the small size of the dataset many candidate hyper-parameters were tuned to get the best results. The following classifiers were selected:

Intializers
Dropout rate
Activation function
Learning rate reduction
Early Stopping
Units Dense layer of hidden layers

Additionally two different text representation were tried:

TF-IDF Vectors
Word Embeddings

Problems Encoutered

Due to the small size of the dataset it is difficult to draw a conclusion regarding the performance of the various models.

Getting Started

Prerequisites

This package needs tensorflow installed on GPU.

other modules:
* keras * sklearn * gensim * numpy * pandas * bs4

File Description

There are two different .py files; one for the TF-IDF Vectors and one for the Word Embeddings
The data files requiered for the train, validation and test datasets can be found here , and are under the Restaurant-SB1 category.

Data Preprocessing

In order to prepare the data for training some preprocessing of the data was necessary. Mainly, to remove punctuation symbols and also use lower case symbols for correct word (string) comparisons. Furthermore, the dataset classes examples were imbalanced. One class (neutral) had very few training examples. The table below shows the particularity of this dataset including validation data after removing na’s.

Class	Class Encoding	Instances
Negative	0	1696
Neutral	1	104
Positive	2	1696

To overcome the above imbalanced set the following techniques were tested:

used the validation and train data as one set and used cross validation method which is suggested in such circumstances
used stratified sampling during the above process to handle the imbalance between the classes
used a data augmentation technique to ensure that the maximum possible information from the neutral class. The same data were copied another 8 times adding white noise to the tf-idf vectors that were produced.

Furthermore, the following features were added: a. One hot encoding of the Categories and Subcategories of the dataset. b. The existence or not of capitalized sentences.

Authors

George Vefeidis
Leon Kalderon
Georgia Sarri

LeonKalderon/TextClassification-with-MLP