Quora-Question-Pairs

Utilizing scikit-learn and Keras libraries to test out Random Forests and Siamese Manhatten distance LSTM classifiers to determine if a question pair on Quora is actually a duplicate or not.

Capstone Document: https://github.com/azhenxuan/Quora-Question-Pairs/blob/master/Capstone%20Project.pdf
Python 2.7 (using Anaconda's distribution)
Data obtained from:
- Both train.csv and test.csv: https://www.kaggle.com/c/quora-question-pairs/data
- Google's News corpus: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit | mirror: https://github.com/mmihaltz/word2vec-GoogleNews-vectors
Libraries required:
- NLTK: https://anaconda.org/anaconda/nltk
- Keras (Tensorflow backend): https://anaconda.org/anaconda/keras
- python-Levenshtein: https://pypi.python.org/pypi/python-Levenshtein
- scikit-learn: https://anaconda.org/anaconda/scikit-learn
- matplotlib: https://anaconda.org/anaconda/matplotlib
- numpy
- pandas
- gensim: https://anaconda.org/anaconda/gensim
- tensorflow CPU/GPU: https://www.tensorflow.org/install/
Requirements file to replicate environment: https://github.com/azhenxuan/Quora-Question-Pairs/blob/master/requirements.txt

azhenxuan/Quora-Question-Pairs

Quora-Question-Pairs