/NeuBAROCO

Evaluating the ability of machine learning models to make logical inferences

Primary LanguagePython

NeuBAROCO

Datasets and scripts for the ACL2024 Findings paper: "Exploring Reasoning Biases in Large Language Models Through Syllogism: Insights from the NeuBAROCO Dataset".

Contents

Datasets

NLI (Natural Language Inference) Task Format

File

data/NeuBAROCO_NLI.tsv

Description

Column Name Description
ID problem ID
ORIGINAL_ID (INTERNAL) original problem ID
premises_ja two premises in Japanese
hypothesis_ja one hypothesis in Japanese
premises_en two premises in English
hypothesis_en one hypothesis in English
gold correct answer, the relationship of the hypothesis to the premises (entailment, contradiction, neutral)
mood the form of each premise and conclusion (three letters composed of A, E, I and O)
inference-type type of logical inferences (syllogism, propositional)
content-type classification based on belief congruency (symbolic, congruent, incongruent)
conversion associated with conversion error (yes, no)
atmosphere associated with atmosphere effect (yes, no)
  • See our paper for details on content-type, inference-type, conversion, and atmosphere.

Multiple-Choice Task Format

File

data/NeuBAROCO_MC.tsv

Description

Column Name Description
ID problem ID
premises_ja two premises in Japanese
hypothesis_ja_1 hypothesis 1 in Japanese
hypothesis_ja_2 hypothesis 2 in Japanese
hypothesis_ja_3 hypothesis 3 in Japanese
hypothesis_ja_4 hypothesis 4 in Japanese
hypothesis_ja_5 hypothesis 5 in Japanese
premises_en1 two premises in English
hypothesis_en_1 hypothesis 1 in English
hypothesis_en_2 hypothesis 2 in English
hypothesis_en_3 hypothesis 3 in English
hypothesis_en_4 hypothesis 4 in English
hypothesis_en_5 hypothesis 5 in English
gold correct answer (1-5)
content-type classification based on belief congruency (symbolic, contentual, congruent, incongruent)
mood the form of each premise and conclusion (three letters composed of A, E, I and O)
figure code for the order in which each term appears (1-4)
  • NOTE: One of the five hypotheses is "none of them".

Data used in the NALOMA2023 experiments

File

data/naloma2023/NeuBAROCO_NALOMA.tsv

Running scripts

Setup

git clone https://github.com/kmineshima/NeuBAROCO
cd NeuBAROCO
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Set API keys

export OPENAI_API_KEY=<YOUR_KEY>  # For OpenAI API
export HUGGINGFACE_API_KEY=<YOUR_KEY>  # For HuggingFace Inference Endpoints API

Evaluation

ACL2024 experiments

Basic usage

python -m scripts.experiments.acl2024 --help

NLI Task

Example:

python -m scripts.experiments.acl2024 nli --test_n=all --lang en ja --model gpt-3.5-turbo-1106 gpt-4-0613

Multiple-Choice Task

Example:

python -m scripts.experiments.acl2024 choice5 --test_n=all --lang en ja --model gpt-3.5-turbo-1106 gpt-4-0613

Citation

If you use this data in any published research, please cite the following:

@inproceedings{ozeki-etal-2024-exploring,
    title = "Exploring Reasoning Biases in Large Language Models Through Syllogism: Insights from the {N}eu{BAROCO} Dataset",
    author = "Ozeki, Kentaro  and
      Ando, Risako  and
      Morishita, Takanobu  and
      Abe, Hirohiko  and
      Mineshima, Koji  and
      Okada, Mitsuhiro",
    editor = "Ku, Lun-Wei  and
      Martins, Andre  and
      Srikumar, Vivek",
    booktitle = "Findings of the Association for Computational Linguistics ACL 2024",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand and virtual meeting",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-acl.950",
    pages = "16063--16077",
}