This repo implements a set of question-answering networks on the SQuAD dataset. They range from baselines, using just a dense layer or an lstm as a decoder up to models which were state-of-the-art at the time of publication. The models are:
- Base-Model with Dense Layer as Decoder
- LSTM-Model with a BiLSTM Layer plus an implementation of Pointer Net as Decoder
- MatchLSTM
- RNet
- QANet
While they since have been improved on using transfer-learning approaches such as ELMo, OpenAI GPT or Bert I hope they still provide a useful reference for interested practitioners.
We used preprocessing and layer implementations from a set of other great repos implementing those models which are:
- https://github.com/HKUST-KnowComp/R-Net
- https://github.com/NLPLearn
- https://github.com/MurtyShikhar/Question-Answering
This repo has been tested using python3.6 and the following package versions:
- numpy==1.16.2
- tabulate==0.8.3
- tqdm==4.31.1
- bottle==0.12.16
- spacy==2.0.18
- pyyaml==5.1
- tensorflow-gpu==1.12
First, you need to download the SQuAD dataset and the Glove word embeddings and
store them in folders datasets/squad
and datasets/glove
respectively.
You can do this by running
$ ./download.sh
in your terminal.
Next, you need to install the necessary python packages. I have provided a Pipfile for this purpose. To be able to use it, you need to have pipenv installed on your system. To do this, run
$ pip install -U pipenv
The Pipfile includes the path to the tensorflow wheel with gpu support. This is recommended to train the models. If you want to build tensorflow without gpu-support you can comment out the relevant line and un-comment the line below, which is the path to the tensorflow wheel without gpu-support.
Once this is done, you can create a virtual environment with all necessary packages by executing
$ pipenv install
You can access the virtualenv using
$ pipenv shell
Once you're done, exit it by typing
$ exit
If you have cuda installed on your system, you can run the models within this virtual environment.
You can also try running the models in a docker container, which can be created using the Dockerfile in this repo. Build the Docker Image using
$ docker build -t nlp-tf1.12 .
if you have Nvidia-Docker installed (follow the instructions here and here to install it). Once this is done start the docker container using
$ docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0 -v /path/to/repo/question-answering:/question-answering -it nlp-tf1.12 /bin/bash
Change to the directory of the repo,
$ cd /question-answering
and start training/predicting!
Each model includes a config.yaml file with all parameters. Parameters which have the value "-1" are not relevant for this particular model. Any paths or parameter values can be changed using this files. To train any of the models, run the following command in your docker container:
$ python3 train.py --config [model_name]/config.yaml
This will start training and print the current state on screen as well as into an output file called "phrase_level_qa". If the current epoch improves the score, a checkpoint of the model is saved into out_dir defined in config.yaml.
To get the results on the test set you have to specify the folder
in the corresponding config file which the model checkpoint you want to test
is stored in. The parameter in the config.yml file is called use_out_dir
.
For example if the model checkpoint is stored in folder 20190403083913
specify this in the config file.
Parameter checkpoint
specifies which checkpoint should be loaded. If it is
an empty string the last checkpoint will be loaded. If you want to load
checkpoint model-9
you would specify '9' in config.yml.
Once this is done run
$ python3 test.py --config [model_name]/config.yaml
This will display the result on screen and in the log file. It will
also save the predictions in a json file called answer.json
. This
file can be used to compute the score using the official SQuAD evaluation
script. This can be done by running
$ python3 evaluate-v1.1.py ./data/squad/dev-v1.1.json ./runs/[model_name]/answer.json
Here I've collected some results from the various implementations. The results come from the split off part of the training set, not the official dev set. All experiments were run on an NVIDIA GTX 1080 Ti gpu.
Model | Training Epochs | Size | EM | F1 | train-time (hrs) |
---|---|---|---|---|---|
base-model | 30 | 150 | 26.65 | 37.27 | ~18 |
lstm | 30 | 150 | 43.68 | 54.12 | ~30 |
match-lstm | 30 | 150 | 60.31 | 70.26 | ~35 |
cudnn-match-lstm | 50 | 150 | 60.43 | 70.32 | ~12 |
QA-Net | 50 | 128 | 67.81 | 77.74 | ~15 |
R-Net | 30 | 32 | 57.04 | 66.99 | ~38 |
cudnn-R-Net | 50 | 32 | 68.84 | 78.14 | ~8 |