Models and code for addressing the Winograd Schema Challenge with training data from the SNLI/MultiNLI corpora.
For convenience, this repository pre-packages some dependency code and data not created by the authors.
- The dataset of Winograd Schemas at
./datasets/winograd/WSCollection.xml
is taken from Ernest Davis's (NYU) website. - We package a modified implementation of the baseline NLI models from the Machine Learning for Language Group at NYU, which is stored in
./model
First, install Python 3. Then, start by cloning this repository:
git clone https://github.com/sgbalogh/nlu-winograd
cd nlu-winograd
Pre-requisites can be installed simply with:
make
Optionally, you can run the test suite with:
make test
In order to run the TensorFlow NLI model implementations, some datasets need to be downloaded first.
Create a data
directory in ./model
containing an additional nested directory winograd
; additionally, create a logs
directory within ./model
:
mkdir -p ./model/data/winograd
mkdir -p ./model/logs
cd ./model/data
Then download and unzip SNLI, MNLI, and GloVe:
wget https://www.nyu.edu/projects/bowman/multinli/multinli_0.9.zip
wget https://nlp.stanford.edu/projects/snli/snli_1.0.zip
wget http://nlp.stanford.edu/data/glove.840B.300d.zip
unzip ./*.zip
We also need the Stanford Parser, which should be stored in ./apps
cd nlu-winograd
mkdir -p ./apps
cd apps
wget https://nlp.stanford.edu/software/stanford-parser-full-2018-02-27.zip
unzip ./*.zip
Now you should be all set.
The repository contains a copy of the XML document provided by Ernest Davis. The local copy is located at datasets/winograd/WSCollection.xml
.
From a command line, open a Python 3 shell in the home directory of this repository.
import wnlu
## Initializing a translator class automatically
## parses all of the examples from the XML document:
loader = wnlu.WinogradLoader()
## This loops through the dev set instances and prints out
## the original premise content:
for instance in loader.get_train_set():
print(instance.get_premise())
winograd_example = loader.get_train_set()[0]
print(winograd_example.get_premise())
## Get a list of the two possible translations of the
## schema (i.e., the two ways of replacing the pronoun):
possible_translations = winograd_example.get_candidate_translations()
## To just view the possible answers:
winograd_example.answers
## If we want to see the GOLD label, we can get the index
## of it within the answers list (above) using:
winograd_example.gold_answer_idx
## Load from the Rahman/Ng corpus instead of Winograd:
rahman_ng_set = loader.get_rahman_ng_set()
If you're running Windows, you may encounter a problem with setting the JAVA_HOME environment variable, even if this is configured in your PC settings. For a quick fix, add a couple of lines at the start of your code:
import os
java_path = "C:/Program Files/Java/jdk1.7.0_11/bin/java.exe" # change the directory accordingly
os.environ['JAVAHOME'] = java_path
Two scripts are provided for two different interfaces from Winograd translation into JSONL format necessary for input to the NLI models.
convertToJSON.py
uses the translation interface specified within thewnlu
module to generate dev and test outputs directlyconvertTextToJSON.py
performs a similar function, but reads in from a text file, making it more suitable for experimentation with different translation strategies; it needs to be passed a path to the input text file, followed by a path to the output JSON -- the input format expected is:
<Winograd-ID>
<Premise>
<Hypothesis>
<GOLD label>
<Winograd-ID>
<Premise>
<Hypothesis>
<GOLD label>
...
If you want to create a version capable of being used as input to convertTextToJSON.py
, try this:
import wnlu
loader = wnlu.WinogradLoader()
corpus = loader.get_train_set()
wnlu.SentenceVariants.create_intermediate(corpus, "/path/to/save.txt")
To see paraphrases of the Winograd train, dev and test sets and the Rahman and Ng set, run the following code on the command line:
python ./ParaphrasingStrategies.py
Similarly, to generate truncated versions of the Winograd sets you can run:
python ./SentenceVariants.py
This will generate four text files, namely, train, dev and test sets of the Winograd schema and the entire set of the Rahman and Ng Winograd Schema. These files can be fed into convertTextToJSON.py
to generate the json files that can be fed into the trained NLI models.