/nlu-winograd

Code and data related to final project for NYU DS-GA 1012

Primary LanguagePythonMIT LicenseMIT

nlu-winograd.py

Models and code for addressing the Winograd Schema Challenge with training data from the SNLI/MultiNLI corpora.

Acknowledgements

For convenience, this repository pre-packages some dependency code and data not created by the authors.

  • The dataset of Winograd Schemas at ./datasets/winograd/WSCollection.xml is taken from Ernest Davis's (NYU) website.
  • We package a modified implementation of the baseline NLI models from the Machine Learning for Language Group at NYU, which is stored in ./model

Overview

General Environment Setup

First, install Python 3. Then, start by cloning this repository:

git clone https://github.com/sgbalogh/nlu-winograd
cd nlu-winograd

Pre-requisites can be installed simply with:

make

Optionally, you can run the test suite with:

make test

Model Training Environment Setup

In order to run the TensorFlow NLI model implementations, some datasets need to be downloaded first.

Create a data directory in ./model containing an additional nested directory winograd; additionally, create a logs directory within ./model:

mkdir -p ./model/data/winograd
mkdir -p ./model/logs
cd ./model/data

Then download and unzip SNLI, MNLI, and GloVe:

wget https://www.nyu.edu/projects/bowman/multinli/multinli_0.9.zip
wget https://nlp.stanford.edu/projects/snli/snli_1.0.zip
wget http://nlp.stanford.edu/data/glove.840B.300d.zip
unzip ./*.zip

We also need the Stanford Parser, which should be stored in ./apps

cd nlu-winograd
mkdir -p ./apps
cd apps
wget https://nlp.stanford.edu/software/stanford-parser-full-2018-02-27.zip
unzip ./*.zip

Now you should be all set.

Loading Winograd Schema Dev/Test instances

The repository contains a copy of the XML document provided by Ernest Davis. The local copy is located at datasets/winograd/WSCollection.xml.

From a command line, open a Python 3 shell in the home directory of this repository.

import wnlu

## Initializing a translator class automatically
## parses all of the examples from the XML document:
loader = wnlu.WinogradLoader()

## This loops through the dev set instances and prints out
## the original premise content:
for instance in loader.get_train_set():
  print(instance.get_premise())

winograd_example = loader.get_train_set()[0]
print(winograd_example.get_premise())

## Get a list of the two possible translations of the
## schema (i.e., the two ways of replacing the pronoun):
possible_translations = winograd_example.get_candidate_translations()

## To just view the possible answers:
winograd_example.answers

## If we want to see the GOLD label, we can get the index
## of it within the answers list (above) using:
winograd_example.gold_answer_idx


## Load from the Rahman/Ng corpus instead of Winograd:
rahman_ng_set = loader.get_rahman_ng_set()

If you're running Windows, you may encounter a problem with setting the JAVA_HOME environment variable, even if this is configured in your PC settings. For a quick fix, add a couple of lines at the start of your code:

import os

java_path = "C:/Program Files/Java/jdk1.7.0_11/bin/java.exe" 	# change the directory accordingly
os.environ['JAVAHOME'] = java_path

Working With Winograd -> NLI Translation

Two scripts are provided for two different interfaces from Winograd translation into JSONL format necessary for input to the NLI models.

  • convertToJSON.py uses the translation interface specified within the wnlu module to generate dev and test outputs directly
  • convertTextToJSON.py performs a similar function, but reads in from a text file, making it more suitable for experimentation with different translation strategies; it needs to be passed a path to the input text file, followed by a path to the output JSON -- the input format expected is:
<Winograd-ID>
<Premise>
<Hypothesis>
<GOLD label>

<Winograd-ID>
<Premise>
<Hypothesis>
<GOLD label>

...

If you want to create a version capable of being used as input to convertTextToJSON.py, try this:

import wnlu
loader = wnlu.WinogradLoader()
corpus = loader.get_train_set()
wnlu.SentenceVariants.create_intermediate(corpus, "/path/to/save.txt")

Generating Paraphrases

To see paraphrases of the Winograd train, dev and test sets and the Rahman and Ng set, run the following code on the command line:

python ./ParaphrasingStrategies.py

Similarly, to generate truncated versions of the Winograd sets you can run:

python ./SentenceVariants.py

This will generate four text files, namely, train, dev and test sets of the Winograd schema and the entire set of the Rahman and Ng Winograd Schema. These files can be fed into convertTextToJSON.py to generate the json files that can be fed into the trained NLI models.