LM Calibration

This repository contains code for the paper How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering

Install

Our code is mainly based on T5 and mesh-tensorflow and runs on TPUs. Please follow the original T5 repository to properly setup TPUs. To install required packages, download T5 (version 0.6.4) and mesh-tensorflow (version 0.1.16) and copy source files into the t5 and mesh_tensorflow folder. Don't replace files already in these folders because those files are the files we modified for calibration purpose.

Fine-tune

Run the following commands to fine-tune the UnifiedQA models with softmax or margin objective functions. $tpu specifies the name of the TPU, $model_output specifies the output location to save the fine-tuned model, $objective specifies the objective function to use.

./finetune.sh $tpu 3B $model_output $objective uq_clean_train_ol_mix train mc

Evaluate candidate answers

Run the following commands to evaluate the probabilities of candidate answers. $score_output specifies the location to save the output, and 1103000 specifies the checkpoint to use.

./score.sh $tpu $score_output $model_output 1103000 uq_clean_test dev

Compute ECE

Run the following commands to compute the ECE metric given the probabilities of candidate answers.

python cal.py --mix uq_clean_test --split dev --score $score_output