
Quantifiers and monotonicity in reasoning tasks

SARN: Inference

Quantifiers and monotonicity (and opposite adjectives) in reasoning tasks


python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt


# Finetuning
python -m sarn.train --output-dir models/bart-mq --log-dir logs/bart-mq facebook/bart-large-mnli data/training.csv
# Inference of two sequences (forwards)
python -m sarn.classify --model models/bart-mq "All dogs jumped over the fence." "Some small dogs jumped over the fence."
# ROC curve (SVG and PDF diagram)
python -m sarn.roc microsoft/deberta-large-mnli data/evaluation.csv
# Model accuracy on dataset
python -m sarn.accuracy models/deberta-mq data/evaluation.csv
# Dataset statistics
python -m sarn.stats data/training.csv
# Language Interpretability Tool
python -m sarn.lit \
  --models "facebook/bart-large-mnli" \
           "microsoft/deberta-large-mnli" \
           "./models/bart-mq" \
           "./models/deberta-mq" \
           "./models/bart-adj" \
           "./models/deberta-adj" \
  --datasets "./data/evaluation.csv" "./data/evaluation-adj.csv" \

As model, any valid Huggingface model (local or remote) can be specified that has been finetuned for sequence classification, e.g., facebook/bart-large-mnli, microsoft/deberta-large-mnli or a local path like models/bart-mq.


Download Models from CoLi Servers

# export COLI_USER=<your name>
scp -r ${COLI_USER:?}@last.cl.uni-heidelberg.de:/mnt/semproj/sem_proj20/proj1/models .

Check if labels do not have typos in datasets

for i in ./data/*.csv; do
  python3 -m sarn.validate_datasets "$i"


See data/README.md for more information.

  • data/training.csv: Training dataset for quantifiers and monotonicity in reasoning tasks
  • data/evaluation.csv: Evaluation dataset for quantifiers and monotonicity in reasoning tasks
  data/training-adj.csv: Training dataset for quantifiers and monotonicity in reasoning tasks with opposite adjectives
  data/evaluation-adj.csv: Evaluation dataset for quantifiers and monotonicity in reasoning tasks with opposite adjectives

Creating the datasets

# data/training.csv
wget https://github.com/verypluming/MED/raw/master/MED.tsv
python -m sarn.convert.med
wget https://github.com/verypluming/HELP/raw/master/output_en/pmb_train_v1.0.tsv
python -m sarn.convert.help
cat data/med.csv data/help.csv > data/training.csv

# data/evaluation.csv
wget https://nlp.stanford.edu/~wcmac/downloads/fracas.xml
python -m sarn.convert.fracas
wget -O diagnostic-full.tsv https://www.dropbox.com/s/ju7d95ifb072q9f/diagnostic-full.tsv?dl=1
python -m sarn.convert.superglue
cat data/fracas.csv data/superglue.csv > data/evaluation.csv

# data/training-adj.csv
wget https://github.com/verypluming/MED/raw/master/MED.tsv
python -m sarn.convert.med_adjectives
# manual step here (you may modify sentences where it makes sense):
# - label data/med_adjectives_1.csv by hand (third column)
# - label data/med_adjectives_2.csv by hand (third column)
# - remove fourth column in both files
wget https://nlp.stanford.edu/~wcmac/downloads/fracas.xml
python -m sarn.convert.fracas_adjectives
cat data/med_adjectives_1.csv data/med_adjectives_2.csv data/fracas_adjectives.csv > data/training-adj.csv
python -m sarn.validate_datasets data/training-adj.csv

# data/evaluation-adj.csv
python -m sarn.convert.evaluation_adjectives
# manual step here (you may modify sentences where it makes sense):
# - label data/evaluation-adj.csv by hand (third column)
# - remove fourth column
python -m sarn.validate_datasets data/evaluation-adj.csv

Dataset statistics

Character length

Dataset avg median min max
Premises 48.26 41 5 478
Hypotheses 48.93 42 5 478
Premises 79.84 58 26 206
Hypotheses 61.57 50 26 186
Premises 48.93 44 14 212
Hypotheses 50.78 46 18 210
Premises 100.19 83 25 189
Hypotheses 86.62 69 35 189

Word length

Dataset avg median min max
Premises 9.98 9 2 83
Hypotheses 10.10 9 2 83
Premises 13.03 10 5 34
Hypotheses 10.14 9 5 30
Premises 8.89 8 3 29
Hypotheses 9.18 9 3 29
Premises 15.44 12 5 31
Hypotheses 13.47 11 5 31


Dataset total contradiction neutral entailment
data/training.csv 41'273 0 (0.00%) 20'699 (50.15%) 20'574 (49.85%)
data/evaluation.csv 118 15 (12.71%) 52 (44.07%) 51 (43.22%)
data/training-adj.csv 1'206 420 (34.83%) 749 (62.11%) 37 (3.07%)
data/evaluation-adj.csv 144 47 (32.64%) 84 (58.33%) 13 (9.03%)


  • facebook/bart-large-mnli: Pretrained model of BART finetuned on MultiNLI
  • microsoft/deberta-large-mnli: Pretrained model of DeBERTa finetuned on MultiNLI
  • models/bart-mq: finetuned version of facebook/bart-large-mnli on data/training.csv
  • models/deberta-mq: finetuned version of microsoft/deberta-large-mnli on data/training.csv
  • models/bart-adj: finetuned version of models/bart-mq on data/training-adj.csv
  • models/deberta-adj: finetuned version of models/deberta-mq on data/training-adj.csv

Creating the models

# facebook/bart-large-mnli and microsoft/deberta-large-mnli will automatically
# be downloaded from huggingface.co when used

# models/bart-mq
python -m sarn.train --output-dir "models/bart-mq" --log-dir "logs/bart-mq" "facebook/bart-large-mnli" "data/training.csv"

# models/deberta-mq
python -m sarn.train --output-dir "models/deberta-mq" --log-dir "logs/deberta-mq" "microsoft/deberta-large-mnli" "data/training.csv"

# models/bart-adj
python -m sarn.train --output-dir "models/bart-adj" --log-dir "logs/bart-adj" "facebook/bart-large-mnli" "data/training-adj.csv"

# models/deberta-adj
python -m sarn.train --output-dir "models/deberta-adj" --log-dir "logs/deberta-adj" "microsoft/deberta-large-mnli" "data/training-adj.csv"

Model statistics


Model data/evaluation.csv data/evaluation-adj.csv
facebook/bart-large-mnli 65.25% 40.97%
microsoft/deberta-large-mnli 71.19% 47.22%
models/bart-mq 57.63% 34.72%
models/deberta-mq 61.86% 34.72%
models/bart-adj 45.76% 58.33%
models/deberta-adj 42.37% 57.64%

ROC curves

An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. [...] AUC stands for "Area under the ROC Curve." That is, AUC measures the entire two-dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1). [...] AUC provides an aggregate measure of performance across all possible classification thresholds. One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example.

Google Machine Learning Crash Course