This repository contains the HANS (Heuristic Analysis for NLI Systems) dataset.
The file heuristics_evaluation_set.txt
contains the HANS evaluation set introduced in our paper, Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. This file is formatted similarly to the MNLI release, so if your system is trained on MNLI you may be able to feed this file directly into your system. Otherwise, you may need to reformat the data to fit your system's input format.
The fields in this file are:
gold_label
: The correct label for this sentence pair (eitherentailment
ornon-entailment
)sentence1_binary_parse
: A binary parse of the premise, generated using a template based on the Stanford PCFG; this is necessary as input for some tree-based models.sentence2_binary_parse
: A binary parse of the hypothesis, generated using a template based on the Stanford PCFG; this is necessary as input for some tree-based models.sentence1_parse
: A parse of the premise, generated using a template based on the Stanford PCFGsentence2_parse
: A parse of the hypothesis, generated using a template based on the Stanford PCFGsentence1
: The premisesentence2
: The hypothesispairID
: A unique identifier for this sentence pairheuristic
: The heuristic that this example is targeting (lexical_overlap
,subsequence
, orconstituent
)subcase
: The subcase of the heuristic that is being targeted; each heuristic has 10 subcases, described in the appendix to the papertemplate
: The specific template that was used to generate this pair (most of the subcases have multiple templates; e.g., for subcases depending on relative clauses, there might be one template for relative clauses modifying the subject, and another for relative clauses modifying the direct object). This template ID corresponds to the ID intemplates.py
.
The file heuristics_train_set.txt
contains the set of HANS-like examples that were used for the data augmentation experiments in Section 7 of the HANS paper. This file is set up exactly like heuristics_evaluation_set.txt
(i.e., it also contains 1000 examples from each of the 30 HANS subcases), but none of the specific examples that appear in heuristics_evaluation_set.txt
appear in heuristics_train_set.txt
, so that heuristics_train_set.txt
can be used for data augmentation during training while still preserving the validity of heuristics_evaluation_set.txt
as an evaluation set (e.g., in the paper we trained models on the union of heuristics_train_set.txt
and the MNLI training set, and then evaluated on heuristics_evaluation_set.txt
).
The training set and evaluation set are also both include as JSON lines files, with the .jsonl
extension.
We provide a script for evaluating a model's predictions. These predictions must be formatted in a text file with the following properties:
- The first line should be "pairID,gold_label"
- The rest of the lines should contain the
pairID
for a premise/hypothesis pair, followed by a comma, followed by the model's prediction for that pair (eitherentailment
ornon-entailment
; for this purpose, you will need to change bothcontradiction
andneutral
intonon-entailment
). - This file should have 30,001 lines: 1 line for the header, plus 30,000 more lines for the 30,000 examples in HANS
There are several example files provided here: bert_preds.txt
, decomp_attn_heur_preds.txt
, spinn_preds_heur.txt
, and esim_heur_preds.txt
.
To evaluate a file formatted in this way, simply run:
python evaluate_heur_output.py FILENAME
This will give you results broken down at three levels of granularity.
- First, it will give results for the 3 heuristics, showing for each heuristic the model's accuracy on examples where the correct label is
entailment
and its accuracy on examples where the correct label isnon-entailment
. - Second, it will give accuracies for all 30 subcases of the heuristics (e.g. subject/object swap, NP/S, etc.)
- Finally, it will give accuracies for each template
The file mnli_contradicting_examples contains a list of the examples in the MNLI training set that contradict the heuristics targeted by HANS. For the scripts that we used to find these examples, see the folder heuristic_finder_scripts.
This repository is licensed under an MIT License.
If you use data from this repository, please cite our paper (BibTex here).