NeuBAROCO

Datasets and scripts for the ACL2024 Findings paper: "Exploring Reasoning Biases in Large Language Models Through Syllogism: Insights from the NeuBAROCO Dataset".

Datasets
Running scripts
Citation

Datasets

NLI (Natural Language Inference) Task Format

File

data/NeuBAROCO_NLI.tsv

Description

Column Name	Description
ID	problem ID
ORIGINAL_ID	(INTERNAL) original problem ID
premises_ja	two premises in Japanese
hypothesis_ja	one hypothesis in Japanese
premises_en	two premises in English
hypothesis_en	one hypothesis in English
gold	correct answer, the relationship of the hypothesis to the premises (entailment, contradiction, neutral)
mood	the form of each premise and conclusion (three letters composed of A, E, I and O)
inference-type	type of logical inferences (syllogism, propositional)
content-type	classification based on belief congruency (symbolic, congruent, incongruent)
conversion	associated with conversion error (yes, no)
atmosphere	associated with atmosphere effect (yes, no)

See our paper for details on content-type, inference-type, conversion, and atmosphere.

Multiple-Choice Task Format

File

data/NeuBAROCO_MC.tsv

Description

Column Name	Description
ID	problem ID
premises_ja	two premises in Japanese
hypothesis_ja_1	hypothesis 1 in Japanese
hypothesis_ja_2	hypothesis 2 in Japanese
hypothesis_ja_3	hypothesis 3 in Japanese
hypothesis_ja_4	hypothesis 4 in Japanese
hypothesis_ja_5	hypothesis 5 in Japanese
premises_en1	two premises in English
hypothesis_en_1	hypothesis 1 in English
hypothesis_en_2	hypothesis 2 in English
hypothesis_en_3	hypothesis 3 in English
hypothesis_en_4	hypothesis 4 in English
hypothesis_en_5	hypothesis 5 in English
gold	correct answer (1-5)
content-type	classification based on belief congruency (symbolic, contentual, congruent, incongruent)
mood	the form of each premise and conclusion (three letters composed of A, E, I and O)
figure	code for the order in which each term appears (1-4)

NOTE: One of the five hypotheses is "none of them".

Data used in the NALOMA2023 experiments

File

data/naloma2023/NeuBAROCO_NALOMA.tsv

Evaluating Large Language Models with NeuBAROCO: Syllogistic Reasoning Ability and Human-like Biases (Ando et al., NALOMA-WS 2023)

Running scripts

Setup

git clone https://github.com/kmineshima/NeuBAROCO
cd NeuBAROCO
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Set API keys

export OPENAI_API_KEY=<YOUR_KEY>  # For OpenAI API
export HUGGINGFACE_API_KEY=<YOUR_KEY>  # For HuggingFace Inference Endpoints API

Evaluation

ACL2024 experiments

Basic usage

python -m scripts.experiments.acl2024 --help

NLI Task

Example:

python -m scripts.experiments.acl2024 nli --test_n=all --lang en ja --model gpt-3.5-turbo-1106 gpt-4-0613

Multiple-Choice Task

Example:

python -m scripts.experiments.acl2024 choice5 --test_n=all --lang en ja --model gpt-3.5-turbo-1106 gpt-4-0613

Citation

If you use this data in any published research, please cite the following:

ACL Anthology: Exploring Reasoning Biases in Large Language Models Through Syllogism: Insights from the NeuBAROCO Dataset (Ozeki et al., Findings 2024)
arXiv preprint

@inproceedings{ozeki-etal-2024-exploring,
    title = "Exploring Reasoning Biases in Large Language Models Through Syllogism: Insights from the {N}eu{BAROCO} Dataset",
    author = "Ozeki, Kentaro  and
      Ando, Risako  and
      Morishita, Takanobu  and
      Abe, Hirohiko  and
      Mineshima, Koji  and
      Okada, Mitsuhiro",
    editor = "Ku, Lun-Wei  and
      Martins, Andre  and
      Srikumar, Vivek",
    booktitle = "Findings of the Association for Computational Linguistics ACL 2024",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand and virtual meeting",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-acl.950",
    pages = "16063--16077",
}

kmineshima/NeuBAROCO

NeuBAROCO

Contents

Datasets

NLI (Natural Language Inference) Task Format

File

Description

Multiple-Choice Task Format

File

Description

Data used in the NALOMA2023 experiments

File

Running scripts

Setup

Set API keys

Evaluation

ACL2024 experiments

Basic usage

NLI Task

Multiple-Choice Task

Citation