This repository contains the code used for data collection and experimentation.
The benchmark data and answers given by each model can be found under benchmark/
.
The jsonl files contain a JSON object on each line with the following format:
{
"label_nr": 1|2, The option corresponding to the correct answer.
"label_name": "de dicto"|"de re", Correct answer class name.
"messages": ["..."], The list of messages to be sent to the LLM (one in case of direct prompting, two in case of chain-of-thought prompting.)
"entity": "...", The name of the 'main entity'.
"property": "...", The property ascribed to the definite description.
"prompt_style": "...", A key into PROMPT_DICTIONARY in create_fragment.py .
The above elements are just the benchmark itself, below the fields corresponding to the answers given by the models.
"responses": ["..."], The replies given by the LLMs in response to the messages.
"results": {
"choice": 1|2 , The option chosen by the model.
"explanation": "..." The explanation provided by the model for why it chose the option that it did.
}
}