This repo is based on 🤗 Transfomers implementation of the T5 model and BERT. T5 data processing pipeline is used from the original T5 repository for pre-training (span corruption, prefix-lm) and fine-tuning. BERT data processing pipeline is used from Megatron-LM.
Multi-gpu and multi-node training with Horovod is supported. APEX is used for FP16 and mixed-precision training. Sparse Attention from DeepSpeed is used.
BERT model supports such additional features as pre-attention layer norm, sparse attention, relative position and rotary embeddings.
T5 and BERT pre-training is implemented in run_(model_type)_pretraining.py
scripts.
Training tools, such as Trainer, are in lm_experiments_tools
package.
Full requirements for all experiments are specified in requirements.txt. Install requirements after cloning the repo:
grep -v "^#" requirements.txt | xargs -n 1 -L 1 pip install
Currenty, T5 text-to-text installation might install tf2.8.0+, downgrade TF related packages with:
pip install tensorflow==2.6.0 tensorflow-estimator==2.6.0 tensorflow-text==2.6.0 tensorflow-io-gcs-filesystem==0.21.0 keras==2.6.0
todo: reorder reqs in requirements.txt.
lm_experiments_tools
include Trainer with multi-gpu/node with Horovod and APEX FP16 for models compatible with
HF interface. Most of the scripts in the repo use Trainer from lm_experiments_tools
.
note: install torch and horovod according to your setup before
lm_experiments_tools
installation.
pip install -e .
This command will install lm_experiments_tools
with only required packages for Trainer and tools.
lm_experiments_tools
Trainer supports gradient accumulation, logging to tensorboard, saving the best models
based on metrics, custom metrics and data transformations support.
Depending on your setup just pip install horovod==0.24.2
might work.
Building Horovod with NCCL for PyTorch:
HOROVOD_NCCL_HOME=... HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_WITH_PYTORCH=1 pip install --no-cache-dir horovod[pytorch]==0.24.2 --no-binary=horovod
check installation with
horovodrun --check-build
For further details check Horovod documentation: https://horovod.readthedocs.io/en/stable/install_include.html
Install APEX https://github.com/NVIDIA/apex#quick-start
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
apex.amp is moved to torch.cuda.amp NVIDIA/apex#818, but:
speed: APEX O1
< torch.cuda.amp
< APEX O2
resources (unordered):
- https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html
- https://pytorch.org/docs/stable/notes/amp_examples.html
- https://spell.ml/blog/mixed-precision-training-with-pytorch-Xuk7YBEAACAASJam
- https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/
- horovod/horovod#1089
- NVIDIA/apex#818
DeepSpeed Sparse attention supports only GPUs with compute compatibility >= 7 (V100, T4, A100), CUDA 10.1, 10.2, 11.0, or 11.1 and runs only in FP16 mode (as of DeepSpeed 0.6.0).
pip install triton==1.0.0
DS_BUILD_SPARSE_ATTN=1 pip install deepspeed==0.6.0 --global-option="build_ext" --global-option="-j8" --no-cache
and check installation with
ds_report
Triton 1.1.1 brings x2 speed-up to sparse operations on A100, but DeepSpeed (0.6.5) currenly supports only triton 1.0.0. DeepSpeed fork with triton 1.1.1 support could be used in the cases where such speed-up is needed:
pip install triton==1.1.1
git clone https://github.com/yurakuratov/DeepSpeed.git
cd DeepSpeed
DS_BUILD_SPARSE_ATTN=1 pip install -e . --global-option="build_ext" --global-option="-j8" --no-cache
and run sparse ops tests with
cd tests/unit
pytest -v test_sparse_attention.py
export CUDA_VISIBLE_DEVICES=4,5; horovodrun --gloo -np 2 python run_t5_pretraining.py \
--batch_size 32 \
--gradient_accumulation_steps 2 \
--save_interval 100000 \
--log_interval 500 \
--iters 1100000 \
--data_path ~/data/ThePile/Wikipedia/preprocessed_shards \
--model_path ./runs/small_wiki_bs_128 \
--input_seq_len 512 \
--target_seq_len 192 \
--lr 5e-05 \
--weight_decay 1e-05 \
--model_cfg ./t5configs/t5-small.json \
--model_cls modeling_t5:T5ForConditionalGeneration
and continue interrupted training
export CUDA_VISIBLE_DEVICES=0,1,2,3; horovodrun --gloo -np 4 python run_t5_pretraining.py \
--batch_size 8 \
--gradient_accumulation_steps 4 \
--save_interval 75000 \
--log_interval 500 \
--iters 1000000 --data_path ~/data/ThePile/Wikipedia/preprocessed_shards \
--model_path ./runs/base_wiki_enc_only_cdq_fixed_pos_wo_tanh \
--input_seq_len 512 \
--target_seq_len 192 \
--lr 5e-05 \
--weight_decay 1e-05 \
--model_cls modeling_t5:T5ForConditionalGeneration \
--model_cfg t5configs/t5-base-only-cdQ.json \
--init_checkpoint ./runs/base_wiki_enc_only_cdq_fixed_pos_wo_tanh/model_150000.pth
python -m deeppavlov train config_name
Gradient accumulation for dp:T5Text2TextModel
, e.g.:
batch_size
: 32sub_batch_size
: 16
means that full batch of size batch_size
will be splited on two sub-batches of size sub_batch_size
to accumulate their gradients.
Base configuration files are at ./dp_configs/glue
Fine-tuning and evaluation could be done with command:
export CUDA_VISIBLE_DEVICES=6; python evaluate_model.py single \
--pretrained-checkpoint ./runs/small_wiki_bs_128/model_1100000.pth \
--task-config ./dp_configs/glue \
--suffix bs_32/run_0 \
--train-batch-size 32
pretrained-checkpoint
is a path to pretrained checkpoint that would be trained and evaluated, task-config
is a
folder with DP configs (or single DP config), suffix
would be appended to a model path. Check evaluate_model.py
for
more details.
config: ./dp_configs/glue/glue_mixture.json
Use save_every_n_batches
parameter to save the model, set metrics: []
and evaluation_targets: []
in DP configs.
Train model on datasets mixture, check all available options in evaluate_model.py:train_mixture()
:
export CUDA_VISIBLE_DEVICES=1; python evaluate_model.py train-mixture \
--pretrained-checkpoint ./runs/small_wiki_bs_128/model_1100000.pth \
--task-config ./dp_configs/glue/glue_mixture.json \
--suffix bs_128 \
--train-batch-size 128
Evaluation for all checkpoints in checkpoint
folder, saves best checkpoints and evaluation results:
export CUDA_VISIBLE_DEVICES=0; python evaluate_model.py mixture \
--checkpoint ./runs/small_wiki_bs_128/glue/mixture/bs_128/ \
--pretrained-checkpoint ./runs/small_wiki_bs_128/model_1100000.pth \
--task-config ./dp_configs/glue \
--save-best
To get the best scores for all fine-tuned models and tasks run:
python evaluate_model.py collect-metrics \
--pretrained-checkpoint ./runs/small_wiki_bs_128/model_1100000.pth --clean > report.txt
use --clean
option to delete all models checkpoints except the best ones for each task.
TBD
QQP is currently not available via tfds: tensorflow/datasets#3031
to hot-fix this go to the source code of installed tfds tensorflow_datasets/text/glue.py:215
and replace QQP data url with https://dl.fbaipublicfiles.com/glue/data/QQP.zip
WMT configs could be found in ./dp_configs/wmt
Training with Horovod+DeepPavlov:
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7; horovodrun --gloo -np 8 python -m deeppavlov train ./dp_configs/ende_hvd.json
Multi-gpu training and evaluating with evaluate_model.py
(recommended):
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7; python evaluate_model.py single \
--pretrained-checkpoint ./runs/small_wiki_bs_128/model_1100000.pth \
--task-config ./dp_configs/wmt/ende.json \
--suffix bs_128_hvd/run_0 \
--train-batch-size 16 \
--lr 5e-05
Data preprocessing readme
add --fp16
and --apex_opt_lvl O2
or --apex_opt_lvl O1
(default) as arguments to run_t5_pretraining.py
Adafactor was used to train such models as T5, BigBird, PaLM and others. Adafactor lowers required memory by keeping moving average of per-parameter second moments factorized.
Adafactor parameters:
scale_parameter
- lr is scaled by root mean square of parameter: lr * RMS(p)relative_step
- lr = 1/sqrt(step)warmup_init
- linear warm up from 1e-06 to 0.01 at 10k steps, works only in combination withrelative_step
Adafactor can be used with constant lr / lr schedulers. In this case, relative_step
and warmup_init
should be set to False. scale_parameter
is does not depend on learning rate schedules and can be used with external learning rates.
example for pretraining scripts:
--optimizer Adafactor --lr 1e-03 --scale_parameter \
--lr_scheduler constant_with_warmup --num_warmup_steps 10000
e.g. for DP config
"optimizer": "Adafactor",
"optimizer_parameters": {
"lr": 1e-03,
"weight_decay": 0.0,
"scale_parameter": true,
"relative_step": false,
"warmup_init": false
}
BERT model training supports sparse attentions from DeepSpeed.
DeepSpeed Sparse attention docpage -- https://www.deepspeed.ai/tutorials/sparse-attention.
SparseAttention parameters are passed to the model with HF model configuration file:
"sparse_config_cls": "deepspeed.ops.sparse_attention:BigBirdSparsityConfig",
"sparse_attention": {
"num_heads": 12,
"block": 16,
"different_layout_per_head": true,
"num_sliding_window_blocks": 1,
"num_global_blocks": 1,
"num_random_blocks": 1
}
You can also check bert_base_uncased-4L_sparse.json
config example in bert_configs
folder.