KITMUS

This repository contains the dataset generation code for the KITMUS test suite, which is described in the ACL 2023 paper The KITMUS Test: Evaluating Knowledge Integration from Multiple Sources.

If you use the dataset or code in your research, please consider citing the paper.

Content

This repository contains:

The generated KITMUS test suite dataset (kitmus/)
The code to generate the dataset (generate.py, texts.py, utils.py)
The templates and resources to generate the KITMUS test suite dataset (resources/)
The train- and test set predictions from the experiments of the paper (predictions/)
The code to evaluate predictions against gold annotations (evaluate.py, utils.py)

Setup

Runs on Python 3.8. Required packages can be installed with pip install -r requirements.txt.

Usage

Main scripts:

generate.py
evaluate.py

To learn more about any script and its parameters, run python <SCRIPT> -h. If you run into any issues when running the scripts, please create an issue.

Generating the KITMUS test dataset

To (re-)generate the KITMUS dataset with default hyperparameters as used in the experiments described in the paper, run:

python generate.py

This will create a folder kitmus/ which will take up about 4GB of space in total.

Evaluating a model prediction

To evaluate a jsonlines prediction file as output by e.g. C2F, BERT4Coref or a tsv prediction file as output by e.g. PeTra, GREP, run:

python evaluate.py <PATH-TO-GOLD-CONLL-FILE> <PATH-TO-PREDICTION-FILE>

Prediction files for the experiments featured in the paper can be found in predictions/. For a more detailed explanation of the evaluation metrics, see section 5.3 Evaluation in the paper.

Generating a custom dataset

The easiest way to generate a custom dataset is to specify an alternative resource directory to generate.py with the command line argument --resources_dir. A valid resources directory should have the following file structure:

<RESOURCES-DIR>/
├── locations.csv
├── names.csv
├── noise
├── occupations
│   ├── charfict_charfict.csv
│   ├── charfict_real.csv
│   ├── charfict_wordfict.csv
│   ├── real_charfict.csv
│   ├── real_real.csv
│   └── real_wordfict.csv
├── pronouns.json
├── templates
│   ├── background_knowledge_sentence.txt
│   ├── entity_mention_templates.json
│   ├── entspec_knowledge_sentence.txt
│   ├── meet_sentence.txt
│   └── pronoun_sentence.txt
└── vocab.json

The directory <RESOURCES-DIR>/noise/ is not necessary for generating the background-train-no-noise variant. Similarly, only <RESOURCES-DIR>/occupations/real_real.csv is needed for the background-train-* variants. Take a look at the files provided in resources/ to understand the necessary fields and structure of each kind of file.

If the custom dataset is in a language with a similar morphological structure as English, it should be sufficient to modify only the resources. For other languages, it may be necessary to write custom rules in the functions create_knowledge_sents and create_task_sents in texts.py. An example of a custom rule for the English a/an distinction is already present in the code.

Citation

@inproceedings{arodi-etal-2023-kitmus,
    title = "The {KITMUS} Test: Evaluating Knowledge Integration from Multiple Sources",
    author = {Arodi, Akshatha  and
      P{\"o}msl, Martin  and
      Suleman, Kaheer  and
      Trischler, Adam  and
      Olteanu, Alexandra  and
      Cheung, Jackie Chi Kit},
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.841",
    pages = "15088--15108",
}

mpoemsl/kitmus