DCNN: A Python repository from cosmmb

This is the code for our ACL 2015 paper "Dependency-based Convolutional Neural Networks for Sentence Embedding".

Thanks to Yoon Kim for sharing his code and giving suggestions to us for this project. Our project is extended based on the code for his paper "Convolutional Neural Networks for Sentence Classification" (EMNLP 2014). We post this project with Yoon's permission. You are welcome to adapt and optimize our project, but please do not use our code for commercial purpose.

This code runs on python 2.66 and Theano 0.7.

Our model is purely based on words. There is no POS tag information included in our model. There are many ways to improve the performance by including tag info. The most simplest way is treat tag as words and include the tags in convolution. Another way is to use different convolution filters (w in paper) for the words with different tags. You are welcome for discussing or collaborating in the extension with us.

The paper can be found : http://people.oregonstate.edu/~mam/pdf/papers/DCNN.pdf

This version only contains tree+sib model, and this can be easily extend to tree+sib+seq model.

file description:

1. folder "TREC" contains the TREC dataset with 6 categories. Data is from here: http://cogcomp.cs.illinois.edu/Data/QA/QC/ . "TREC_all.txt" is the original data. After we parsed the TREC data set with Stanford parser, we get "TREC_all_parsed.txt". "label_all.txt" is the label for each sentence in "TREC_all.txt".

2. "preindex.py" reforms the sentence into a tree format from the parse file.

3. "process_TREC.py" is the file for text precessing.

4. "conv_net_classes.py" contains some basic function for CNN

5. "conv_sib_gpu.py" is our main function.

6. folder "data" is where you should put the word2vec binary file in order to let "process_TREC.py" works. You could find the file here: https://code.google.com/p/word2vec/

7. "log_170.txt" is the accuracy for training, dev and testing set in each epoch. This result is generated by GPU. 170 means this is the result with 170 as batch size. For other training settings you can find in "conv_sib_gpu.py"

Instruction:

first step: download word2vec file and save it in "data" folder.

second step:
run "python process_TREC.py" ("preindex.py" will be run in this file).

third step:
THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python conv_sib_gpu.py 170

This code uses GPU(Tesla K80), but the code still works for CPU. If you want to test on cpu, you could change the above device=gpu to device=cpu and floatX=float32 to floatX=float64. Since there is a precision difference between gpu and cpu, the results will be slightly different in some cases. Compared with other hyperparameters, the performace of the model is relatively sensitive to batch_size and lr_decay. I would suggest to tune these two hyperparameter first.

In our implementation, we use 10% of training data as dev set. We do not recycle the dev set to train the model again. Some people do this and I believe this will improve the performance.

Mingbo Ma

cosmmb@gmail.com

EECS

Oregon State University

Sep 25 2015

cosmmb/DCNN