/bert-pretraining

The project is a python module that facilitates BERT pretraining. The current existing open source solution for training this specific model is convoluted. We have simplified the procedure. The project's goal is to open the code to the wider Machine Learning community to help ML practitioners train their own BERT models using their data. The code was created to train the latest iteration of VMware's BERT model (vBERT) to help Machine Learning and Natural Language Processing Researchers within VMware.

Primary LanguagePythonApache License 2.0Apache-2.0

Bert-Pretraining

The project is a python module that facilitates BERT pretraining. The current existing open source solution for training this specific model is convoluted. We have simplified the procedure. The project's goal is to open the code to the wider Machine Learning community to help ML practitioners train their own BERT models using their data. The code was created to train the latest iteration of VMware's BERT model (vBERT) to help Machine Learning and Natural Language Processing Researchers within VMware.

The Demo notebook is located within demo folder


Setup

Env Setup

Setup a Python 3.7 or 3.8 virtual env and install the requirements using
pip install . from within the root folder

or

pip install git+https://github.com/vmware-labs/bert-pretraining

Pretraining data

Create the pretraining data using create_pretraining_data.py from https://github.com/google-research/bert .

You can create a seperate eval file if you want to evaluate your model's MLM and NSP accuracies on a seperate eval set during training.

You can also split a single file into training and eval vectors by using the split_ratio parameter in the config object.


Config

The pretraining parameters are handled through the Pretraining_Config class. Please follow the Demo.ipynb to run the a sample bert pretraining.

PRETRAINING_CONFIG PARAMS

Parameter Default Value Description
model_name DEMOBERT Model name
is_base True Boolean to select between BERT-Base and BERT-Large
max_seq_length 128 MSL, should be consistent with the tfrecord file (generate 2 seperate files if you want to pretrain BERT with different MSLs eg: 128, 512)
max_predictions_per_seq 20 Number of tokens masked for MLM, should be consistent with the tfrecord file
num_train_steps 1000 Number of steps to train the model for, terminates if we reach the end of tfrecord file (meaningful pretraining would require more training steps)
num_warmup_steps 10 Number of warmup steps, BERT uses 1% of training steps as warmup steps
learning_rate 1e-05 Model Learning rate
train_batch_size 32 Training batch size (split across GPUs)
save_intermediate_checkpoints True Save checkpoints for every 'x 'training steps decided by the save_checkpoint_steps. Checkpoint will always be saved at the end of training
save_intermediate_checkpoint_steps 25000 Saves checkpoint after every 'x' training steps (not including warmup steps)
eval_batch_size 32 Evaluation batch size (split across GPUs)
max_eval_steps 1000 Number of steps to perform evaluation on when there is no seperate eval file.
If a seperate eval file is provided or if split_ratio is provided, the entire eval dataset is used for evaluation
eval_point 1000 Performs evaluation for every 'x' training steps
split_ratio None Percent of training dataset to use for evaluation if you want to split training tfrecord into train, eval datasets.
If no split ratio is provided, the training file will be used for evaulation (number of eval steps is controlled by the max_eval_steps parameter)
init_checkpoint None If you are resuming training provide the path to previous checkpoint. If you are initializing the training from a non default checkpoint(BERT-Base, BERT-Large), provide the model checkpoint name/path).
input_file ./input/demo_MSL128.tfrecord Input tfrecord file created using create_pretraining_data.py from https://github.com/google-research/bert
eval_file None If you want to use seperate eval dataset, provide the input tfrecord file created using create_pretraining_data.py from https://github.com/google-research/bert
log_csv ./eval_results.csv File which stores the evaluation results **
output_dir ./ckpts Directory to store the checkpoints
num_gpu 3 Number of GPUs to use for training

** The output log_csv file records the hyperparameters and evaluation results

The demo.tfrecord file was created from the wikicorpus dataset.


Contributing

The bert-pretraining project team welcomes contributions from the community. Before you start working with this project please read and sign our Contributor License Agreement (https://cla.vmware.com/cla/1/preview). If you wish to contribute code and you have not signed our Contributor Licence Agreement (CLA), our bot will prompt you to do so when you open a Pull Request. For any questions about the CLA process, please refer to our CONTRIBUTING.md.


License

Apache-2.0