Classifying Duplicate Questions with TensorFlow

The following are the brief descriptions of the files that you may find useful when analysing the code.

qqp_BaselineModels.py

This script is the base script that does EDA and text feature generation. Because the Quora dataset is large the feature generation is done in chunks and saved in HD5 file format.

Please note that because the generation of semantic_similarity and word_order_similarity score takes too long, I have generated the data separately into a HD5 file (df_all_train.wordnet.h5). This file contains all the semantic_similarity and word_order_similarity scores for each of the training and test questions from Quora. A join by index is then used to tag these scores back to their respective questions.

Next, we will then use word2vec to generate image features and append these features into the dataframe in chunks and save it into another HD5 file (df_all_train.h5) due to memory constraints.

Once the df_all_\train.h5 file is created, this is then basically the core set of text features to be used to train the models.

Included in this file are codes for GridSearchCV using SVM and XGBoost. Please note that these will take a long time to train.

global_settings.py

Settings file that is being used by qqp_BaselineModels.py.

img_feat_gen.py

Generate the image features based on the word2vec similarity scores

wordnetutils.py

Generate WordNet based semantic_similarity and word_order_similarity scores. You probably don't need to run this code to run the exercise. Just use the df_all_train_pres.h5 provided in the link below. However, if you want to use this code in your own work, please remember to attribute to Sujit Pal and this site.

parallelproc.py

Implementation of parallel apply for Python on Mac. Unfortunately this code does not yet work on Windows as it relies on UNIX fork to spawn multiple threads.

qqp_TensorFlowCNN_Model.py

Code as shown in presentation and Jupyter notebook.

AllowedNumbers.csv & AllowedStopwords.csv

List of numbers and stopwords that will not be removed during pre-processing

Links to HD5 files

You need to download the pre-built training data set before you can run the examples. Please note that these are large files.

df_all_temp_pres.csv - Approx 92 MB

df_all_train_pres.h5 - Approx 479 MB

ianlokh/Duplicate_Questions