

Primary LanguagePython


Quora question pairs from Kaggle


Build neural network, which identifies question duplicates from Quora questions


Training set has 6 columns: pair id, id of the first question, id of the second question, first question, second question, boolean variable "isDuplicate". Testing set has only pair id, first question and second question. Column "isDuplicate" needs to be predicted in the test set.

https://yadi.sk/d/v8ras2A93GcJzV link to the train set

https://yadi.sk/d/KkgteUcs3GcK47 link to the test set


I am going to use GLOVE word embeddings(or maybe some other embeddings) to construct word vectors. Loss function equals abs( Ytrue-Ypredicted) where Ytrue is either 0 or 1, depending on duplicate variable.


For each question pair: 1) delete all duplicate words 2) calculate average from remaining words in the first question and in the second question 3)subtract one from another 4) train simple neural network with 100 nodes, 1 hidden layer and sigmoid activation function in order to provide nesessary results. I should also note that question pairs should be artificially padded: for each question we should create many new, previously unseen question pairs? in order to get more non-duplicate pairs.
