BERT pre-training on The Stack

Exploration of BERT-like models trained on The Stack.

  • Code used to train StarEncoder.

    • StarEncoder was fine-tuned for PII detection to pre-process the data used to train StarCoder
  • This repo also contains functionality to train encoders with contrastive objectives.

  • More details.

To launch pre-training:

After installing requirements, training can be launched via the example launcher script:

./launcher.sh

Note that:

  • --train_data_name can be used to use to set the training dataset.

  • Hyperparamaters can be changed in exp_configs.py.

    • The tokenizer to be used is treated as a hyperparameter and then must also be set in exp_configs.py.
    • alpha is used to weigh the BERT losses (NSP+MLM) and the contrastive objective.
      • Setting alpha to 1 corresponds to the standard BERT objective.
    • Token masking probabilities are set as separate hyperparameters, one for MLM and another one for the contrastive loss.