🚀 Get the dataset now! cladder-v1.zip
- zip file size: 6.5MB
- Version: v1
- Date: 2023-05-25
- Huggingface dataset: https://huggingface.co/datasets/causalnlp/CLadder
This repo contains the full CLadder dataset (and code) for evaluating (formal) causal reasoning in language models. The dataset asks yes/no questions in natural language that generally require statistical and causal inference to answer.
Although there are several different variants, the main dataset (including questions from all variants) is cladder-v1-balanced.json
, so that is the recommended file to use for most purposes.
"CLadder: Assessing Causal Reasoning in Language Models" by Zhijing Jin*, Yuen Chen*, Felix Leeb*, Luigi Gresele*, Ojasv Kamal, Zhiheng Lyu, Kevin Blin, Fernando Gonzalez, Max Kleiman-Weiner, Mrinmaya Sachan, Bernhard Schölkopf.
Citation:
@inproceedings{jin2023cladder,
author = {Zhijing Jin and Yuen Chen and Felix Leeb and Luigi Gresele and Ojasv Kamal and Zhiheng Lyu and Kevin Blin and Fernando Gonzalez and Max Kleiman-Weiner and Mrinmaya Sachan and Bernhard Sch{\"{o}}lkopf},
title = "{CL}adder: {A}ssessing Causal Reasoning in Language Models",
year = "2023",
booktitle = "NeurIPS",
url = "https://openreview.net/forum?id=e2wtjx0Yqu",
}
You can download our data either from huggingface (https://huggingface.co/datasets/causalnlp/CLadder), or cladder-v1.zip in our repo.
In our data, each sample represents a single question. Each question has the following fields:
question_id
: a unique (for the file) identifier for the questiondesc_id
: a more descriptive identifier for the question (generally not needed)given_info
: natural language supplementary information that should be given to the model to answer the question.question
: the question itself, in natural languageanswer
: the answer to the question {yes, no}reasoning
: a step-by-step explanation of the causal reasoning used to answer the questionmeta
: metadata about the question, including the following fields:query_type
: the type of question, one of {ATE, marginal, correlation, ETT, NDE, NIE, etc.}rung
: the rung of the ladder of causation that the question corresponds tostory_id
: the id of the story used to verbalize the questiongraph_id
: the id of the causal graph structure used to verbalize the questionmodel_id
: the id of the underlying model used to generate the question (corresponding to a model incladder-v1-meta-models.json
)groundtruth
: the groundtruth value of what the question is asking about
When evaluating a language model, it is recommended that the prompt includes 3 components:
- The
background
field of the model corresponding to the question (found incladder-v1-meta-models.json
using themodel_id
field of the question's metadata). - The
given_info
field of the question. - The
question
field of the question.
For example, the prompt corresponding to question 16825 (which asks about the average treatment effect for a simple instrumental variable setting) in cladder-v1-balanced.json
could be:
Imagine a self-contained, hypothetical world with only the following conditions, and without any unmentioned factors or causal relationships: Unobserved confounders has a direct effect on education level and salary. Proximity to a college has a direct effect on education level. Education level has a direct effect on salary. Unobserved confounders is unobserved.
For people living far from a college, the probability of high salary is 35%. For people living close to a college, the probability of high salary is 53%. For people living far from a college, the probability of college degree or higher is 40%. For people living close to a college, the probability of college degree or higher is 73%.
Will college degree or higher decrease the chance of high salary?
Here the correct answer is no
. The associated reasoning steps found in the reasoning
field are:
Step 0: Let V2 = proximity to a college; V1 = unobserved confounders; X = education level; Y = salary.
Step 1: V1->X,V2->X,V1->Y,X->Y
Step 2: E[Y | do(X = 1)] - E[Y | do(X = 0)]
Step 3: [P(Y=1|V2=1)-P(Y=1|V2=0)]/[P(X=1|V2=1)-P(X=1|V2=0)]
Step 4: P(Y=1 | V2=0) = 0.35; P(Y=1 | V2=1) = 0.53; P(X=1 | V2=0) = 0.40; P(X=1 | V2=1) = 0.73
Step 5: (0.53 - 0.35) / (0.73 - 0.40) = 0.55
Solution: 0.55 > 0
Note that in addition to the background
field, the model information found in cladder-v1-meta-models.json
contains sufficient information to fully reconstruct the underlying causal model used to generate this question (and 59 others).
Here are some basic statistics for the main dataset (cladder-v1-balanced.json
).
Number of questions: 10,112 Answers: {"yes": 5,056, "no": 5,056}
Query Types:
Query Type | Rung | Code | Number | Percent |
---|---|---|---|---|
Correlation | 1 | correlation | 1422 | 14.1% |
Marginal Distribution | 1 | marginal | 1580 | 15.6% |
Expaining Away Effect | 1 | exp_away | 158 | 1.6% |
Average Treatment Effect | 2 | ate | 1422 | 14.1% |
Backdoor Adjustment Set | 2 | backadj | 1580 | 15.6% |
Collider Bias | 2 | collider_bias | 158 | 1.6% |
Effect of the Treatment on the Treated | 3 | ett | 1264 | 12.5% |
Natural Direct Effect | 3 | nde | 316 | 3.1% |
Natural Indirect Effect | 3 | nie | 790 | 7.8% |
Counterfactual (deterministic) | 3 | det-counterfactual | 1422 | 14.1% |
Graph Types:
Graph Type | Number | Percent |
---|---|---|
IV | 790 | 7.8% |
arrowhead | 1264 | 12.5% |
chain | 1106 | 10.9% |
collision | 632 | 6.2% |
confounding | 948 | 9.4% |
diamond | 1106 | 10.9% |
diamondcut | 948 | 9.4% |
fork | 948 | 9.4% |
frontdoor | 1106 | 10.9% |
mediation | 1264 | 12.5% |
If you want to dig a little deeper into understanding how well language models perform causal reasoning, we also include a few variants of the dataset (each of which contains about 10k questions, and the balanced dataset is made up of an even mix of these variants):
cladder-v1-aggregate.json
: a combination of all the variants below but where each story has approximately the same number of questions (100-200).cladder-v1-q-easy.json
: questions that are easy to answer (i.e. the causal mechanisms generally conform to what you would expect)cladder-v1-q-hard.json
: the structure of the causal graph remains unchanged, but the strengths of causal mechanisms are generally counterintuitivecladder-v1-q-commonsense.json
: an even mix of easy and hard questionscladder-v1-q-anticommonsense.json
: for each causal graph we replace one of the variables (either treatment or outcome) with a randomly selected one that common sense would tell you is not related to the other variable at all.cladder-v1-q-nonsense.json
: here the graph structure remains unchanged, but all variables are replaced from semantically meaningful concepts to randomly generated 4-letter words.
To use the codes in this repo, first clone this repo:
git clone https://github.com/causalNLP/causalbenchmark
cd causalbenchmark
Then, install the dependencies:
pip install -r requirements.txt
Finally, install the package:
pip install -e .
Check to make sure everything is setup correctly by running the unit tests:
pytest
Generate demo data using
fig generate demo
Checkout the corresponding config file here.
And the script which is implemented in generator.py - the function generate_and_store
.
Also, you can run the unit tests with
pytest
Check the eval/ folder for all the run_*.py
code files in to see how to run different LLMs in inference mode on our data.
We saved a copy of all model output files, which you can access here.
Thanks again for your interest in our work! Feel free to post a github issue if you have any questions.