Text Classification Experiments

Clone the project
Run "mvn install"
Enter the module folder by "cd classifiers"
Execute the particular java class using the following commands:

4a. To run Cross Fold using LSTM (class LSTM), see the example below:

MAVEN_OPTS=-Xmx16g mvn exec:java -Dexec.mainClass="org.insightcentre.classifiers.dl4j.text.classification.traintest.LSTM" -Dexec.args="-d src/main/resources/data/data.csv -wv src/main/resources/embeddings/Composes/EN-wform.w.5.cbow.neg10.400.subsmpl.txt -nf 10 -ne 10 -in 400 -hi 80 -trb 5 -lr 0.002 -ditClass org.insightcentre.classifiers.dl4j.text.classification.data.iterator.WV_DataIterator -composes"

4b. To run Train Test using CNN (class CNN), see the example below:

MAVEN_OPTS=-Xmx16g mvn exec:java -Dexec.mainClass="org.insightcentre.classifiers.dl4j.text.classification.traintest.CNN" -Dexec.args="-trd src/main/resources/data/training.csv -ted src/main/resources/test.csv -wv src/main/resources/embeddings/Composes/EN-wform.w.5.cbow.neg10.400.subsmpl.txt -ne 7 -in 400 -hi 100 -trb 10 -lr 0.004 -ditClass org.insightcentre.classifiers.dl4j.text.classification.data.iterator.WV_CNN_DataIterator -evalEveryN 2 -modelToSavePath src/main/resources/models/sampleCNN.model -composes -noOfFilters 50 -typeOfFilter 2"

Word Embeddings

For Testing:

	[Glove](http://nlp.stanford.edu/projects/glove/), specifically, Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors, 822 MB download): glove.6B.zip. [Download here](http://nlp.stanford.edu/data/glove.6B.zip).

For Evaluation (current plan):

	[Composes](http://clic.cimec.unitn.it/composes/semantic-vectors.html), specifically, Best predict vectors on this page (5-word context window, 10 negative samples, subsampling, 400 dimensions.) [Download here](http://clic.cimec.unitn.it/composes/materials/EN-wform.w.5.cbow.neg10.400.subsmpl.txt.gz).

Arguments for Train Test LSTM:

Option Description

--composes Composes word embeddings
--ditClass Data Iterator class name with path to be used --evalEveryN Evaluation every n epochs
--hi Hidden layer size
--in Input layer size
--lr Learning rate
--modelToSavePath To save learnt model name
--ne Number of epochs
--ted The CSV Data File containing the test data
--trb Train batch size
--trd The CSV Data File containing the training data --wv Word embeddings

Arguments for Train Test CNN:

Option Description

--composes If using composes, use this as parameter like -
composes, otherwise don't use it.
--ditClass Data Iterator class name with path to be used
--evalEveryN Evaluation every n epochs
--hi Hidden layer size
--in Input layer size
--lr Learning rate
--modelToSavePath To save learnt model name
--ne Number of epochs
--noOfFilters No of filters at first conv layer
--ted The CSV Data File containing the test data
--trb Train batch size
--trd The CSV Data File containing the training data
--truncateLength Max tokens allowed i.e. truncate after this many
tokens in the text
--typeOfFilter Type of filter (bigram, tri gram, etc.), e.g. 2 for bigram, and 3 for trigram
--wv Word embeddings

Arguments for Train Test CNN-LSTM:

Option Description

kasooja/mltc

Text Classification Experiments