/exams-qa

A Multi-subject High School Examinations Dataset for Cross-lingual and Multilingual Question Answering

Primary LanguagePythonCreative Commons Attribution Share Alike 4.0 InternationalCC-BY-SA-4.0

EXAMS: A Multi-subject High School Examinations Dataset for Cross-lingual and Multilingual Question Answering

EXAMS is a new benchmark dataset for cross-lingual and multilingual question answering for high school examinations. It contains more than 24,000 high-quality high school exam questions in 26 languages, covering 8 language families and 24 school subjects from Natural Sciences and Social Sciences, among others. EXAMS offers a fine-grained evaluation framework across multiple languages and subjects, which allows precise analysis and comparison of various models.

This repository contains links to the data, the models, and a set of scripts for preparing the dataset, and evaluating new models.

For more details on how the dataset was created, and baseline models testing multilingual and cross-lingual transfer, please refer to our paper, EXAMS: A Multi-subject High School Examinations Dataset for Cross-lingual and Multilingual Question Answering

Dataset

The data can be downloaded from here: (1) Multilingual, (2) Cross-lingual

The two testbeds are described in the paper (also on arXiv). The files are in jsonl format and follow the ARC Dataset's structure. Each file is named using the following pattern: data/exams/{testbed}/{subset}.jsonl

We also provide the questions with the resolved contexts from Wikipedia articles. The files are in the with_paragraphs folders, folder, and are named {subset}_with_para.jsonl.

Multilingual

In this setup, we want to train and to evaluate a given model with multiple languages, and thus we need multilingual training, validation and test sets. In order to ensure that we include as many of the languages as possible, we first split the questions independently for each language L into TrainL, DevL, TestL with 37.5%, 12.5%, 50% of the examples, respectively.

*For languages with fewer than 900 examples, we only have TestL.

Language Train Dev Test
Albanian 565 185 755
Arabic - - 562
Bulgarian 1,100 365 1,472
Croatian 1,003 335 1,541
French - - 318
German - - 577
Hungarian 707 263 1,297
Italian 464 156 636
Lithuanian - - 593
Macedonian 778 265 1,032
Polish 739 246 986
Portuguese 346 115 463
Serbian 596 197 844
Spanish - - 235
Turkish 747 240 977
Vietnamese 916 305 1,222
Combined 7,961 2,672 13,510

Cross-lingual

In this setting, we want to explore the capability of a model to transfer its knowledge from a single source language Lsrc to a new unseen target language Ltgt. In order to ensure that we have a larger training set, we train the model on 80% of Lsrc, we validate on 20% of the same language, and we test on a subset of Ltgt.

For this setup, we offer per-language subsets for both the train, and dev sets. The file naming patter is {subset}_{lang}.jsonl}, e.g., train_ar.jsonl, train_ar_with_para.jsonl, dev_bg.jsonl, etc.

Finally, in this setup the test.jsonl is the same one as in the Multilignaul one.

Language Train Dev
Albanian 1,194 311
Arabic - -
Bulgarian 2,344 593
Croatian 2,341 538
French - -
German - -
Hungarian 1,731 536
Italian 1,010 246
Lithuanian - -
Macedonian 1,665 410
Polish 1,577 394
Portuguese 740 184
Serbian 1,323 314
Spanish - -
Turkish 1,571 393
Vietnamese 1,955 488

Parallel Questions

The EXAMS dataset contains 10,000 paralell questions, therefore we also provide the mappings between questions in jsonl format. Each row from the file contains a mapping from question id to a list of parallel ones in other languages.

Resolved Hits

We also release the resolved hits from the ElasticSearch including links to the Wikipedia pages, titles, and the returned relevance scores from the engine. The hits are avaible in a tar.gz archive containing a jsonl with the aforementioned fields.

Training and Evaluation

For both scripts the supported values for (multilingual) model types ($MODEL_TYPE) are: "bert", "xlm-roberta", "bert-kb", "xlm-roberta-kb".

The paragraph type ($PARA_TYPE) modes are: 'per_choice', 'concat_choices', 'ignore'

When using EXAMS with run_multiple_choice one should use --task_name exams, otherwise the one suitable for the task, e.g., arc, or race.

Training

We use HuggingFace's scripts for training the models, with slight modifications to allow for 3- to 5-way multiple-choice questions. The python scripts are available under the scripts/experiments folder.

Here is an example:

python ./scripts/experiments/run_multiple_choice.py \
    --model_type $MODEL_TYPE \
    --task_name $TASK_NAME \
    --tb_log_dir runs/${TRAIN_OUTPUT_SUBDIR}/$RUN_SETTING_NAME \
    --model_name_or_path $TRAINED_MODEL_DIR \
    --do_train \
    --do_eval \
    --warmup_proportion ${WARM_UP} \
    --evaluate_during_training \
    --logging_steps ${LOGGING_STEPS} \
    --save_steps ${LOGGING_STEPS} \
    --data_dir $TRAIN_DATA_DIR \
    --learning_rate $LEARNING_RATE \
    --num_train_epochs $MAX_EPOCHS \
    --max_seq_length $MAX_SEQ_LENGTH \
    --output_dir $TRAIN_OUTPUT \
    --weight_decay $WEIGHT_DECAY \
    --overwrite_cache \
    --per_gpu_eval_batch_size=$EVAL_BATCH_SIZE \
    --per_gpu_train_batch_size=$BATCH_SIZE \
    --gradient_accumulation_steps $GRADIENT_ACC_STEPS \
    --overwrite_output 

Evaluation

We provide an evaluation script that allows fine-grained evaluation on both subject, and language level. The script is available at scripts/evaluation/evaluate_exams.py.

Example usage:

python evaluate_exams.py \
    --predictions_path predictions.json \
    --dataset_path dev.jsonl \
    --granularity all \
    --output_path results.json

The possible granularities that the scripts supports are: language, subject, subject_and_language, and all (includes all other options).

A sample predictions file can be found here: sample_predictions.jsonl.

Predictions

The following script can be used to obtain predictions from pre-trained models.

python ./scripts/experiments/run_multiple_choice.py \
    --model_type $MODEL_TYPE \
    --task_name exams \
    --do_test \
    --para_type per_choice \
    --model_name_or_path $TRAINED_MODEL_DIR \
    --data_dir $INPUT_DATA_DIR \
    --max_seq_length $MAX_SEQ_LENGTH \
    --output_dir $OUTPUT_DIR \
    --per_gpu_eval_batch_size=$EVAL_BATCH_SIZE \
    --overwrite_cache \
    --overwrite_output

Contexts

The scripts used for downloading the Wikipedia articles, and context resolution can be in the scripts/dataset folder.

Baselines

The EXAMS paper presents several baselines for zero-shot, and few-shot training using publicly avaible multiple-choice datasets: RACE, ARC, OpenBookQA, Regents.

Multilingual

  • The (Full) models are trained on all aforementioned datasets, including EXAMS.
Lang/Set ar bg de es fr hr hu it lt mk pl pt sq sr tr vi All
Random Guess 25.0 25.0 29.4 32.0 29.4 26.7 27.7 26.0 25.0 25.0 25.0 25.0 25.0 26.2 23.1 25.0 25.9
IR (Wikipedia) 31.0 29.6 29.3 27.2 32.1 31.9 29.7 27.6 29.8 32.2 29.2 27.5 25.3 31.8 28.5 27.5 29.5
XLM-R on RACE 39.1 43.9 37.2 40.0 37.4 38.8 39.9 36.9 40.5 45.9 33.9 37.4 42.3 35.6 37.1 35.9 39.1
w/ SciENs 39.1 44.2 35.5 37.9 37.1 38.5 37.9 39.5 41.3 49.8 36.1 39.3 42.5 37.4 37.4 35.9 39.6
then on Eχαμs (Full) 40.7 47.2 39.7 42.1 39.6 41.6 40.2 40.6 40.6 53.1 38.3 38.9 44.6 39.6 40.3 37.5 42.0
XLM-RBase (Full) 34.5 35.7 36.7 38.3 36.5 35.6 33.3 33.3 33.2 41.4 30.8 29.8 33.5 32.3 30.4 32.1 34.1
mBERT (Full) 34.5 39.5 35.3 40.9 34.9 35.3 32.7 36.0 34.4 42.1 30.0 29.8 30.9 34.3 31.8 31.7 34.6

References

Please cite as [1]. There is also a arXiv version.

[1] M. Hardalov, T. Mihaylov, D. Zlatkova, Y. Dinkov, I. Koychev, P. Nakov "EXAMS: A Multi-subject High School Examinations Dataset for Cross-lingual and Multilingual Question Answering", Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020.

@inproceedings{hardalov-etal-2020-exams,
    title = "{EXAMS}: A Multi-subject High School Examinations Dataset for Cross-lingual and Multilingual Question Answering",
    author = "Hardalov, Momchil  and
      Mihaylov, Todor  and
      Zlatkova, Dimitrina  and
      Dinkov, Yoan  and
      Koychev, Ivan  and
      Nakov, Preslav",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing",
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.438",
    pages = "5427--5444",
    series = "EMNLP~'20"
}

License

The dataset, which contains paragraphs from Wikipedia, is licensed under CC-BY-SA 4.0. The code in this repository is licenced under the Apache 2.0 License.