/gpt-bert

Official implementation of "GPT or BERT: why not both?"

Primary LanguagePythonMIT LicenseMIT

GPT or BERT: why not both?


Lucas Georges Gabriel Charpentier and David Samuel

University of Oslo
Language Technology Group


Paper
HuggingFace 100M model
HuggingFace 10M model
100M Dataset
10M Dataset



Abstract


We present a simple way to merge masked language modeling with causal language modeling. This hybrid training objective results in a model that combines the strengths of both modeling paradigms within a single transformer stack: GPT-BERT can be transparently used like any standard causal or masked language model. We test the pretraining process that enables this flexible behavior on the BabyLM Challenge 2024. The results show that the hybrid pretraining outperforms masked-only or causal-only models. We openly release the models, training corpora and code.



This is the official repository for our BabyLM 2024 submission: GPT-BERT.



Warning: This repository is not yet completed

Completed files/folders:

  • data
  • model_checkpoints
  • tokenizers
  • configs
  • tokenizer_creation
  • pretraining
  • configs
  • corpus_tokenization

Incomplete files/folders:

  • evaluation

Content of this repository

  • ./tokenizer_creation/: Contains scripts for creating a tokenizer.
  • ./corpus_tokenization/: Contains scripts to tokenize a corpus.
  • ./pretraining/: Contains scripts to train a pre-train a model, as well as the model file itself, utils, optimizers, and the PyTorch datasets.
  • ./evaluation/: Contains folders for each benchmark evaluated in the paper. Each folder contains scripts to do fine-tuning (when relevant) and inference as well as a data folder containing the data of the benchmark.
  • ./data/: Folder containing the raw, preprocessed, and tokenized data for pretraining.
  • ./tokenizers/: Folder containing the tokenizers created, or needed for pretraining.
  • ./configs/: Folder containing the configuration files for models.
  • ./model_checkpoints/: Folder containing the pre-trained model checkpoints.


Code to pre-train (and evaluate) a model

This is will be a general guide to pretraining the model, to find out what files to run and what they do, each subfolder will contain a README detailing its content.

  1. (optional) If you do not have a tokenizer, or want to create a custom one, run the script(s) found in tokenizer_creation. The created tokenizers will be saved in tokenizers (unless otherwise specified).
  2. To tokenize the corpus, run the script in corpus_tokenization. The tokenized data will be saved in the folder data (unless otherwise specified). We tokenize before training for efficiency, but in the case this is not wanted, code will need to be adapted in the scripts found in pretraining (specifically the dataset.py file).
  3. Create a config file for your model in the same style as the ones found in the configs folder. Otherwise, choose one of the pre-created ones.
  4. To pre-train your model, run one of the train_*.py scripts found in the pretraining folder. (More details found in the folder itself)
  5. (optional) If you want to evaluate your model based on the evaluations used in the paper, the different tasks and code to run the evaluation can be found in evaluation. Note: to be able to use each part independently of another, the model file is also included in each benchmark folder.


Please cite the following publication (ArXiv, will be updated once BabyLM proceedings are out)

@misc{charpentier2024gptbertboth,
      title={GPT or BERT: why not both?}, 
      author={Lucas Georges Gabriel Charpentier and David Samuel},
      year={2024},
      eprint={2410.24159},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.24159}, 
}