/paraphrase-id-tensorflow

Various models and code for the paraphrase identification task, specifically with the Quora Question Pairs dataset.

Primary LanguagePythonMIT LicenseMIT

Build Status codecov

paraphrase-id-tensorflow

Various models and code for paraphrase identification implemented in Tensorflow (1.1.0).

I took great care to document the code and explain what I'm doing at various steps throughout the models; hopefully it'll be didactic example code for those looking to get started with Tensorflow!

So far, this repo has implemented:

PR's to add more models / optimize or patch existing ones are more than welcome! The bulk of the model code resides in duplicate_questions/models

A lot of the data processing code is taken from / inspired by allenai/deep_qa, go check them out if you like how this project is structured!

Installation

This project has been tested on Python 3.5, and the package requirements are in requirements.txt.

To install the requirements:

pip install -r requirements.txt

GPU Training and Inference

Note that the requirements.txt file specify tensorflow as a dependency, which is a CPU-bound version of tensorflow. If you have a gpu, you should uninstall this CPU tensorflow and install the GPU version by running:

pip uninstall tensorflow
pip install tensorflow-gpu

Getting / Processing The Data

To begin, run the following to generate the auxiliary directories for storing data, trained models, and logs:

make aux_dirs

In addition, if you want to use pretrained GloVe vectors, run:

make glove

which will download pretrained Glove vectors to data/external/. Extract the files in that same directory.

Quora Question Pairs

To use the Quora Question Pairs data, download the dataset from Kaggle (may require an account). Place the downloaded zip archives in data/raw/, and extract the files to that same directory.

Then, run:

make quora_data

to automatically clean and process the data with the scripts in scripts/data/quora.

Running models

To train a model or load + predict with a model, then run the scripts in scripts/run_model/ with python <script_path>. You can get additional documentation about the parameters they take by running python <script_path> -h

Contributors

Contributing

Do you have ideas on how to improve this repo? Have a feature request, bug report, or patch? Feel free to open an issue or PR, as I'm happy to address issues and look at pull requests.

Project Organization

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- Original immutable data (e.g. Quora Question Pairs).
|
├── logs               <- Logs from training or prediction, including TF model summaries.
│
├── models             <- Serialized models.
|
├── requirements.txt   <- The requirements file for reproducing the analysis environment
│
├── duplicate_questions<- Module with source code for models and data.
│   ├── data           <- Methods and classes for manipulating data.
│   │
│   ├── models         <- Methods and classes for training models.
│   │
│   └── util           <- Various helper methods and classes for use in models.
│
├── scripts            <- Scripts for generating the data
│   ├── data           <- Scripts to clean and split data
│   │
│   └── run_model      <- Scripts to train and predict with models.
│
└── tests              <- Directory with unit tests.