- tensorflow==1.3.0
- keras==2.1.2
- pandas==0.20.3
- numpy==1.13.3
- tqdm
- nltk==3.2.4
- re==2.2.1
- train_test_split.ipynb (splits the dataset into train(90%) and test(10%))
- Basemodel3.ipynb - 80% accuracy
- Data_processing_base_model6.ipynb - 81% accuracy
- Bidirectional_lstm_with_attention-base_model5.ipynb - 81.2% accuracy (Final Model)
- test.py (For inference)
''' python test.py --model_weights "models/base_model5" --input_data_loc "data/model_test.csv" '''
- create a folder models
- Download the model weights to your models folder https://drive.google.com/file/d/17MK7Gj4j-Udb1w1ygQLkJJK7loszltWU/view?usp=sharing
- Download the embedding_matrix and tokenizer used for the model from https://drive.google.com/file/d/1Evw1XmyLZCZSruqA60aruuKc2xdIh_hW/view?usp=sharing and save it in your data folder [Defualt]
- Run the above command by specifiying proper locations
https://gist.github.com/prats226/4ba1856a91664671dd7ef9bf9e821ff9
- https://github.com/abhishekkrthakur/is_that_a_duplicate_quora_question/blob/master/deepnet.py
- https://github.com/bradleypallen/keras-quora-question-pairs/blob/master/keras-quora-question-pairs.py
- https://github.com/facebookresearch/poincare-embeddings
- https://github.com/facebookresearch/fastText
- https://pdfs.semanticscholar.org/b31e/447edb0af6ab5ddd4fc0ce3d4a8c6c70882e.pdf?_ga=2.53550959.631894776.1517046815-1803697881.1517046815
- https://github.com/bradleypallen/keras-quora-question-pairs
- I have basically used Embedding layers (word embeddings), LSTM's (As these are good in holding memory for long time) and in some models BiDirectional LSTM with Attention mechanicism (http://www.wildml.com/2016/01/attention-and-memory-in-deep-learning-and-nlp/).
- GloVe embeddings (https://nlp.stanford.edu/projects/glove/) were used to initialize the embedding weights. Words which are not present in glove were represented with zeros. The embedding layers were mostly freezed during training. Training them is leading to over-fitting on dataset.
- Used BatchNormalization between dense layers and drop opt to generalize the network and avoid over fitting and speed up the training progress
- Relu activation functions are used throughout the network. The output layer uses Sigmoid.
- Binary cross entropy is used to evaluate the loss and adam optimizer to compute the gradients.
- Basic feature processing is done after reading the code and comments mentioned in this kaggle kernals. (https://www.kaggle.com/currie32/the-importance-of-cleaning-text) (This improved the model performance)
- Bilateral Multi-Perspective Matching for Natural language Sentences. (https://arxiv.org/pdf/1702.03814.pdf)
- This paper achieves an accuracy of 87% and uses some intitutive architecture
- Given two sentences P and Q, model first encodes them wiha BiLSTM encoder
- Next, we match tow encoded senctences in two directions P against Q and Q against P.
- In each matching direction, each time step of one setence is matched against all time-steps of the other sentence from multiple perspectives.
- Another BiLSTM layer is utilized to aggregate the matching results into a fixed-length matching vector.
- Based on the matching vector a decision is made through a fully connected layer.
- keras Implementation: https://github.com/ijinmao/BiMPM_keras