Bert-Pretraining

The project is a python module that facilitates BERT pretraining. The current existing open source solution for training this specific model is convoluted. We have simplified the procedure. The project's goal is to open the code to the wider Machine Learning community to help ML practitioners train their own BERT models using their data. The code was created to train the latest iteration of VMware's BERT model (vBERT) to help Machine Learning and Natural Language Processing Researchers within VMware.

The Demo notebook is located within demo folder

Setup

Env Setup

Setup a Python 3.7 or 3.8 virtual env and install the requirements using
pip install . from within the root folder

pip install git+https://github.com/vmware-labs/bert-pretraining

Pretraining data

Create the pretraining data using create_pretraining_data.py from https://github.com/google-research/bert .

You can create a seperate eval file if you want to evaluate your model's MLM and NSP accuracies on a seperate eval set during training.

You can also split a single file into training and eval vectors by using the split_ratio parameter in the config object.

Config

The pretraining parameters are handled through the Pretraining_Config class. Please follow the Demo.ipynb to run the a sample bert pretraining.

PRETRAINING_CONFIG PARAMS

Parameter	Default Value	Description
model_name	DEMOBERT	Model name
is_base	True	Boolean to select between BERT-Base and BERT-Large
max_seq_length	128	MSL, should be consistent with the tfrecord file (generate 2 seperate files if you want to pretrain BERT with different MSLs eg: 128, 512)
max_predictions_per_seq	20	Number of tokens masked for MLM, should be consistent with the tfrecord file
num_train_steps	1000	Number of steps to train the model for, terminates if we reach the end of tfrecord file (meaningful pretraining would require more training steps)
num_warmup_steps	10	Number of warmup steps, BERT uses 1% of training steps as warmup steps
learning_rate	1e-05	Model Learning rate
train_batch_size	32	Training batch size (split across GPUs)
save_intermediate_checkpoints	True	Save checkpoints for every 'x 'training steps decided by the save_checkpoint_steps. Checkpoint will always be saved at the end of training
save_intermediate_checkpoint_steps	25000	Saves checkpoint after every 'x' training steps (not including warmup steps)
eval_batch_size	32	Evaluation batch size (split across GPUs)
max_eval_steps	1000	Number of steps to perform evaluation on when there is no seperate eval file. If a seperate eval file is provided or if split_ratio is provided, the entire eval dataset is used for evaluation
eval_point	1000	Performs evaluation for every 'x' training steps
split_ratio	None	Percent of training dataset to use for evaluation if you want to split training tfrecord into train, eval datasets. If no split ratio is provided, the training file will be used for evaulation (number of eval steps is controlled by the max_eval_steps parameter)
init_checkpoint	None	If you are resuming training provide the path to previous checkpoint. If you are initializing the training from a non default checkpoint(BERT-Base, BERT-Large), provide the model checkpoint name/path).
input_file	./input/demo_MSL128.tfrecord	Input tfrecord file created using create_pretraining_data.py from https://github.com/google-research/bert
eval_file	None	If you want to use seperate eval dataset, provide the input tfrecord file created using create_pretraining_data.py from https://github.com/google-research/bert
log_csv	./eval_results.csv	File which stores the evaluation results **
output_dir	./ckpts	Directory to store the checkpoints
num_gpu	3	Number of GPUs to use for training

** The output log_csv file records the hyperparameters and evaluation results

The demo.tfrecord file was created from the wikicorpus dataset.

Contributing

The bert-pretraining project team welcomes contributions from the community. Before you start working with this project please read and sign our Contributor License Agreement (https://cla.vmware.com/cla/1/preview). If you wish to contribute code and you have not signed our Contributor Licence Agreement (CLA), our bot will prompt you to do so when you open a Pull Request. For any questions about the CLA process, please refer to our CONTRIBUTING.md.

License

Apache-2.0