Neural models for sentence classification

This repository contains an implementation of Yoon Kim's paper Convolutional Neural Networks for Sentence Classification http://www.people.fas.harvard.edu/~yoonkim/data/sent-cnn.pdf

We also compare the performance of the CNN network with a simple Bag of words model.

Dataset

The classification model has been trained on a topic classification dataset in the following format.

TOPIC ||| TEXT

How to run

Use the following commands to recreate different training scenarios

train.py is present in the nn folder.

Randomly initialized embeddings with case sensitive words

python train.py \
--init random \
--train ../topicclass/topicclass_train.txt \
--test ../topicclass/topicclass_test.txt \
--valid ../topicclass/topicclass_valid.txt \
--batch_size 32  \
--output_path ./random_upper

Randomly initialized embeddings with case insensitive words

python train.py \
--init random \
--train ../topicclass/topicclass_train.txt \
--test ../topicclass/topicclass_test.txt \
--valid ../topicclass/topicclass_valid.txt \
--batch_size 32  \
--output_path ./random_lower \
--to_lower

Pretrained word2vec embeddings with case sensitive words and frozen embeddings

python train.py \
--init word2vec \
--pretrained_file ../word2vec_upper.json \
--train ../topicclass/topicclass_train.txt \
--test ../topicclass/topicclass_test.txt \
--valid ../topicclass/topicclass_valid.txt \
--batch_size 32  \
--output_path ./word2vec_upper_frozen \
--freeze_embeddings

Pretrained word2vec embeddings with case sensitive words and fine tuned embeddings

python train.py \
--init word2vec \
--pretrained_file ../word2vec_upper.json \
--train ../topicclass/topicclass_train.txt \
--test ../topicclass/topicclass_test.txt \
--valid ../topicclass/topicclass_valid.txt \
--batch_size 32  \
--output_path ./word2vec_upper_nofreeze

Pretrained word2vec embeddings with case insensitive words and frozen embeddings

python train.py \
--init word2vec \
--pretrained_file ../word2vec_lower.json \
--train ../topicclass/topicclass_train.txt \
--test ../topicclass/topicclass_test.txt \
--valid ../topicclass/topicclass_valid.txt \
--batch_size 32  \
--output_path ./word2vec_lower_frozen \
--freeze_embeddings \
--to_lower

Pretrained word2vec embeddings with case insensitive words and fine tuned embeddings

python train.py \
--init word2vec \
--pretrained_file ../word2vec_lower.json \
--train ../topicclass/topicclass_train.txt \
--test ../topicclass/topicclass_test.txt \
--valid ../topicclass/topicclass_valid.txt \
--batch_size 32  \
--output_path ./word2vec_lower_nofreeze \
--to_lower

CBOW model

Pretrained embeddings with case sensitive words

python train.py \
--init word2vec \
--pretrained_file ../word2vec_upper.json \
--train ../topicclass/topicclass_train.txt \
--test ../topicclass/topicclass_test.txt \
--valid ../topicclass/topicclass_valid.txt \
--batch_size 32  \
--output_path ./word2vec_upper_cbow \
--cbow

Pretrained embeddings with case insensitive words

python train.py \
--init word2vec \
--pretrained_file ../word2vec_lower.json \
--train ../topicclass/topicclass_train.txt \
--test ../topicclass/topicclass_test.txt \
--valid ../topicclass/topicclass_valid.txt \
--batch_size 32  \
--output_path ./word2vec_lower_cbow \
--to_lower --cbow

word2vec_lower.json and word2vec_upper.json are pretrained word2vec embeddings. They've been extracted from embeddings trained on Google news dataset available here. They can be downloaded from the following links -

word2vec_lower.json - https://drive.google.com/file/d/1YpCbB0GoLf5iK_Pprr5xsM04atkxO3IT/view?usp=sharing

word2vec_upper.json - https://drive.google.com/file/d/1H_9C3guLZZIaJG_2XRNkw_hs2OCwpT9v/view?usp=sharing

These files have been created using the word2vec_extract_pretrained_embeddings function available in utils.py

punitkoura/sentence-classification