This code provides the backend for the BabyLM Challenge's evaluation pipeline.
We provide support for zero-shot evaluations on BLiMP, as well as scripts for fine-tuning HuggingFace-based models on GLUE tasks.
We also provide a Colab demo of the evaluation pipeline as a demonstration of how to use the code.
If you have questions about or suggestions for this code, please open an issue and consider joining our Slack. We also welcome pull requests!
We adapt this primarily from the BigScience fork of lm-eval-harness, originally by EleutherAI. Support for masked language models was made possible by minicons' implementation of MLM scoring (itself based on code by Salazar et al. (2020)).
To install dependencies, run this:
git clone https://github.com/babylm/evaluation-pipeline
cd evaluation-pipeline
pip install -e ".[dev]"
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
If your GPU is compatible with CUDA 10, replace all instances of cu113
with cu102
.
We provide versions of BLiMP and GLUE which have been filtered according to the vocabulary of the strict-small
dataset. We filter for examples where each word has appeared in our training set at least twice.
Unzip the dataset into the root directory of this repository: unzip filter_data.zip
.
To evaluate a model on zero-shot tasks like BLiMP:
python babylm_eval.py 'path/to/model_and_tokenizer' 'model_type'
Where model_type
is one of "encoder", "decoder" or "encoder-decoder".
To fine-tune and evaluate a model on tasks that require fine-tuning, like the (Super)GLUE tasks:
./finetune_all_tasks.sh 'path/to/model_and_tokenizer'
This script contains hyperparameter defaults that should work for a variety of model sizes, architectures, and tasks. You may adjust these hyperparameters as you wish, though we ask that you submit the best hyperparmeter settings in a README file if you don't use the defaults.
Here are the defaults that we use:
Hyperparameter | Value |
---|---|
Initial learning rate | 5e-5 |
Batch size | 64 |
Maximum epochs | 10 |
Evaluate every (steps) | 200 |
Patience | 10 |
Random seed | 12 |
We provide a shell script that will collect your results into a single file:
./collect_results.sh path/to/model_and_tokenizer
We will ask you to share your results, model, and tokenizer. We will evaluate on held-out tasks (TBA) as part of the final evaluation.
We provide a series of baseline models that we train on our strict or strict-small dataset. These are hosted on HuggingFace.
We simply take the hyperparameters used to pre-train the original versions of these models, and train them on our strict or strict-small datasets. While we do reduce the context length and, in some cases, the batch size, these are otherwise minimally modified.
Here are baseline scores. These are all accuracies, unless otherwise noted by (F1), where we use macro-F1. Random chance accuracy on all BLiMP tasks is 50.
Strict-small Track
BLiMP
Model | Anaphor Agr. | Agr. Structure | Binding | Control/Raising | D-N Agr. | Ellipsis | Filler-Gap | Irregular Forms | Island Effects | NPI Licensing | Quantifiers | S-V Agr. |
---|---|---|---|---|---|---|---|---|---|---|---|---|
OPT-125m | 63.8 | 70.6 | 67.1 | 66.5 | 78.5 | 62 | 63.8 | 67.5 | 48.6 | 46.7 | 59.6 | 56.9 |
RoBERTa-base | 81.5 | 67.1 | 67.3 | 67.9 | 90.8 | 76.4 | 63.5 | 87.4 | 39.9 | 55.9 | 70.5 | 65.4 |
T5-base | 68.9 | 63.8 | 60.4 | 60.9 | 72.2 | 34.4 | 48.2 | 77.6 | 45.6 | 47.8 | 61.2 | 65.0 |
(Super)GLUE
Model | CoLA | SST-2 | MRPC (F1) | QQP (F1) | MNLI | MNLI-mm | QNLI | RTE | BoolQ | MultiRC | WSC |
---|---|---|---|---|---|---|---|---|---|---|---|
Majority label | 69.5 | 50.2 | 82 | 53.1 | 35.7 | 35.7 | 35.4 | 53.1 | 50.5 | 59.9 | 53.2 |
OPT-125m | 64.6 | 81.9 | 72.5 | 60.4 | 57.6 | 60.0 | 61.5 | 60.0 | 63.3 | 55.2 | 60.2 |
RoBERTa-base | 70.8 | 87.0 | 79.2 | 73.7 | 73.2 | 74.0 | 77.0 | 61.6 | 66.3 | 61.4 | 61.4 |
T5-base | 61.2 | 78.1 | 80.5 | 66.2 | 48.0 | 50.3 | 62.0 | 49.4 | 66.0 | 47.1 | 61.4 |
Strict Track
BLiMP
Model | Anaphor Agr. | Agr. Structure | Binding | Control/Raising | D-N Agr. | Ellipsis | Filler-Gap | Irregular Forms | Island Effects | NPI Licensing | Quantifiers | S-V Agr. |
---|---|---|---|---|---|---|---|---|---|---|---|---|
OPT-125m | 94.9 | 73.8 | 73.8 | 72.2 | 93.1 | 80.5 | 73.6 | 80.8 | 57.8 | 51.6 | 74.5 | 77.3 |
RoBERTa-base | 89.5 | 71.3 | 71 | 67.1 | 93.1 | 83.8 | 68.0 | 89.6 | 54.5 | 66.3 | 70.3 | 76.2 |
T5-base | 66.7 | 61.2 | 59.4 | 59.8 | 53.8 | 49.1 | 70.0 | 75.5 | 43.6 | 45.6 | 34.2 | 53.2 |
(Super)GLUE
Model | CoLA | SST-2 | MRPC (F1) | QQP (F1) | MNLI | MNLI-mm | QNLI | RTE | BoolQ | MultiRC | WSC |
---|---|---|---|---|---|---|---|---|---|---|---|
Majority label | 69.5 | 50.2 | 82 | 53.1 | 35.7 | 35.7 | 35.4 | 53.1 | 50.5 | 59.9 | 53.2 |
OPT-125m | 73.7 | 86.6 | 82.1 | 77.8 | 70.1 | 71.9 | 80.1 | 67.7 | 66.0 | 61.1 | 59.0 |
RoBERTa-base | 75.9 | 88.6 | 80.5 | 78.5 | 68.7 | 78.0 | 82.3 | 51.5 | 59.9 | 61.3 | 61.4 |
T5-base | 76.3 | 88.0 | 85.9 | 79.7 | 71.5 | 74.0 | 83.1 | 60.6 | 69.0 | 62.4 | 60.2 |
These are naïve baselines that are meant to provide a starting point for investigation. We look forward to seeing how you will improve upon these!
We provide the code used to train these baselines. We do not recommend using this for your own models, as it loads tokenizers from huggingface instead of training from scratch on the BabyLM data (which does not qualify for any of our tracks). That said, we found (in some quick preliminary experiments) that simply training tokenizers on the BabyLM data often outperforms these baselines!
If you use the datasets or code from this repository, please cite the BabyLM Call for Papers:
@article{warstadt2023papers,
title = {Call for Papers -- The BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus},
author = {Warstadt, Alex and
Choshen, Leshem and
Mueller, Aaron and
Williams, Adina and
Wilcox, Ethan and
Zhuang, Chengxu},
year = {2023},
journal = {Computing Research Repository},
volume = {arXiv:2301.11796}
}
Please also cite the lm-eval-harness paper:
@software{eval-harness,
author = {Gao, Leo and
Tow, Jonathan and
Biderman, Stella and
Black, Sid and
DiPofi, Anthony and
Foster, Charles and
Golding, Laurence and
Hsu, Jeffrey and
McDonell, Kyle and
Muennighoff, Niklas and
Phang, Jason and
Reynolds, Laria and
Tang, Eric and
Thite, Anish and
Wang, Ben and
Wang, Kevin and
Zou, Andy},
title = {A framework for few-shot language model evaluation},
month = sep,
year = 2021,
publisher = {Zenodo},
version = {v0.0.1},
doi = {10.5281/zenodo.5371628},
url = {https://doi.org/10.5281/zenodo.5371628}
}