insuranceQA-cnn-lstm: A Python repository from pcgreat

InsuranceQA using CNN and LSTM

Fixed some minor bugs, remove extra code from author's original code.
Upgrated with tensorflow 1.2 (>= 1.0)
The pythonic dataset originally comes from https://github.com/codekansas/insurance_qa_python

Before running code, you need to convert the original dataset to author's proposed format

cd insurance_qa_python
python3 generate_dataset_for_insuranceQA.py

To Run the code of CNN on tensorflow, please install Tensorflow 1.0, and then

cd ../../insuranceQA-cnn-lstm
PYTHONPATH=. python3 cnn/tensorflow/insqa_train.py

To Run the code of LSTM-CNN (this code is in "cleaner" branch) on tensorflow, please install Tensorflow 1.0, and then

cd ../../insuranceQA-cnn-lstm
PYTHONPATH=. python3 lstm_cnn/tensorflow/insqa_train.py

My Accuracy:

解释一下为什么代码和原作者跑出来的不一样，有一个很大的原因是因为数据的negative sample是随机产生的，很容易产生太多毫无关系的负样本。只有负样本和正样本够接近才利于模型学到pattern，而如果负样本太过随机那模型的准确率也会时高时低。

一个解决这个问题的方法是用tf-idf来产生candidates，这也是insuranceQA的原作者在V2使用的方法https://github.com/shuzi/insuranceQA

-------------from Orignal Author-----------------------------------

See theano and tensorflow folder

This is a CNN/LSTM model for Q&A(Question and Answering), include theano and tensorflow code implementation

theano和tensorflow的网络结构都是一致的: word embedings + CNN + max pooling + cosine similarity

目前再insuranceQA的test1数据集上，top-1准确率可以达到62%左右，跟论文上是一致的。

这里只提供了CNN的代码，后面我测试了LSTM和LSTM+CNN的方法，LSTM+CNN的方法比单纯使用CNN或LSTM效果还要更好一些，在test1上的准确率可以再提示5%-6%

LSTM+CNN的方法在insuranceQA的test1上的准确率为68%