t5-experiments

This repo is based on 🤗 Transfomers implementation of the T5 model and BERT. T5 data processing pipeline is used from the original T5 repository for pre-training (span corruption, prefix-lm) and fine-tuning. BERT data processing pipeline is used from Megatron-LM.

Multi-gpu and multi-node training with Horovod is supported. APEX is used for FP16 and mixed-precision training. Sparse Attention from DeepSpeed is used.

BERT model supports such additional features as pre-attention layer norm, sparse attention, relative position and rotary embeddings.

T5 and BERT pre-training is implemented in run_(model_type)_pretraining.py scripts.

Training tools, such as Trainer, are in lm_experiments_tools package.

Install all requirements

Full requirements for all experiments are specified in requirements.txt. Install requirements after cloning the repo:

grep -v "^#" requirements.txt | xargs -n 1 -L 1 pip install

Currenty, T5 text-to-text installation might install tf2.8.0+, downgrade TF related packages with:

pip install tensorflow==2.6.0 tensorflow-estimator==2.6.0 tensorflow-text==2.6.0 tensorflow-io-gcs-filesystem==0.21.0 keras==2.6.0

todo: reorder reqs in requirements.txt.

Install lm_experiments_tools

lm_experiments_tools include Trainer with multi-gpu/node with Horovod and APEX FP16 for models compatible with HF interface. Most of the scripts in the repo use Trainer from lm_experiments_tools.

note: install torch and horovod according to your setup before lm_experiments_tools installation.

pip install -e .

This command will install lm_experiments_tools with only required packages for Trainer and tools.

lm_experiments_tools Trainer supports gradient accumulation, logging to tensorboard, saving the best models based on metrics, custom metrics and data transformations support.

Install Horovod

Depending on your setup just pip install horovod==0.24.2 might work.

Building Horovod with NCCL for PyTorch:

HOROVOD_NCCL_HOME=... HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_WITH_PYTORCH=1 pip install --no-cache-dir horovod[pytorch]==0.24.2 --no-binary=horovod

check installation with

horovodrun --check-build

For further details check Horovod documentation: https://horovod.readthedocs.io/en/stable/install_include.html

Install APEX

Install APEX https://github.com/NVIDIA/apex#quick-start

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

apex.amp is moved to torch.cuda.amp NVIDIA/apex#818, but:

speed: APEX O1 < torch.cuda.amp < APEX O2

resources (unordered):

Install DeepSpeed

DeepSpeed Sparse attention supports only GPUs with compute compatibility >= 7 (V100, T4, A100), CUDA 10.1, 10.2, 11.0, or 11.1 and runs only in FP16 mode (as of DeepSpeed 0.6.0).

pip install triton==1.0.0
DS_BUILD_SPARSE_ATTN=1 pip install deepspeed==0.6.0 --global-option="build_ext" --global-option="-j8" --no-cache

and check installation with

ds_report

Triron 1.1.1

Triton 1.1.1 brings x2 speed-up to sparse operations on A100, but DeepSpeed (0.6.5) currenly supports only triton 1.0.0. DeepSpeed fork with triton 1.1.1 support could be used in the cases where such speed-up is needed:

pip install triton==1.1.1
git clone https://github.com/yurakuratov/DeepSpeed.git
cd DeepSpeed
DS_BUILD_SPARSE_ATTN=1 pip install -e . --global-option="build_ext" --global-option="-j8" --no-cache

and run sparse ops tests with

cd tests/unit
pytest -v test_sparse_attention.py

T5 Pre-training

T5-small baseline

export CUDA_VISIBLE_DEVICES=4,5; horovodrun --gloo -np 2 python run_t5_pretraining.py \
        --batch_size 32 \
        --gradient_accumulation_steps 2 \
        --save_interval 100000 \
        --log_interval 500 \
        --iters 1100000 \
        --data_path ~/data/ThePile/Wikipedia/preprocessed_shards \
        --model_path ./runs/small_wiki_bs_128 \
        --input_seq_len 512 \
        --target_seq_len 192 \
        --lr 5e-05 \
        --weight_decay 1e-05 \
        --model_cfg ./t5configs/t5-small.json \
        --model_cls modeling_t5:T5ForConditionalGeneration

T5-base with custom layers:

and continue interrupted training

export CUDA_VISIBLE_DEVICES=0,1,2,3; horovodrun --gloo -np 4 python run_t5_pretraining.py \
        --batch_size 8 \
        --gradient_accumulation_steps 4 \
        --save_interval 75000 \
        --log_interval 500 \
        --iters 1000000 --data_path ~/data/ThePile/Wikipedia/preprocessed_shards \
        --model_path ./runs/base_wiki_enc_only_cdq_fixed_pos_wo_tanh \
        --input_seq_len 512 \
        --target_seq_len 192 \
        --lr 5e-05 \
        --weight_decay 1e-05 \
        --model_cls modeling_t5:T5ForConditionalGeneration \
        --model_cfg t5configs/t5-base-only-cdQ.json \
        --init_checkpoint ./runs/base_wiki_enc_only_cdq_fixed_pos_wo_tanh/model_150000.pth

T5 Fine-tuning with DeepPavlov

python -m deeppavlov train config_name

Gradient accumulation for dp:T5Text2TextModel, e.g.:

  • batch_size: 32
  • sub_batch_size: 16

means that full batch of size batch_size will be splited on two sub-batches of size sub_batch_size to accumulate their gradients.

Fine-tuning on GLUE

Base configuration files are at ./dp_configs/glue

Fine-tuning and evaluation could be done with command:

export CUDA_VISIBLE_DEVICES=6; python evaluate_model.py single \
        --pretrained-checkpoint ./runs/small_wiki_bs_128/model_1100000.pth \
        --task-config ./dp_configs/glue \
        --suffix bs_32/run_0 \
        --train-batch-size 32

pretrained-checkpoint is a path to pretrained checkpoint that would be trained and evaluated, task-config is a folder with DP configs (or single DP config), suffix would be appended to a model path. Check evaluate_model.py for more details.

GLUE mixture from T5

config: ./dp_configs/glue/glue_mixture.json

Use save_every_n_batches parameter to save the model, set metrics: [] and evaluation_targets: [] in DP configs.

Train model on datasets mixture, check all available options in evaluate_model.py:train_mixture():

export CUDA_VISIBLE_DEVICES=1; python evaluate_model.py train-mixture \
        --pretrained-checkpoint ./runs/small_wiki_bs_128/model_1100000.pth \
        --task-config ./dp_configs/glue/glue_mixture.json  \
        --suffix bs_128 \
        --train-batch-size 128

Evaluation for all checkpoints in checkpoint folder, saves best checkpoints and evaluation results:

export CUDA_VISIBLE_DEVICES=0; python evaluate_model.py mixture \
        --checkpoint ./runs/small_wiki_bs_128/glue/mixture/bs_128/ \
        --pretrained-checkpoint ./runs/small_wiki_bs_128/model_1100000.pth \
        --task-config ./dp_configs/glue \
        --save-best

Collecting results

To get the best scores for all fine-tuned models and tasks run:

python evaluate_model.py collect-metrics \
        --pretrained-checkpoint ./runs/small_wiki_bs_128/model_1100000.pth --clean > report.txt

use --clean option to delete all models checkpoints except the best ones for each task.

Prepare submission for GLUE Leaderboard:

TBD

QQP

QQP is currently not available via tfds: tensorflow/datasets#3031

to hot-fix this go to the source code of installed tfds tensorflow_datasets/text/glue.py:215 and replace QQP data url with https://dl.fbaipublicfiles.com/glue/data/QQP.zip

Fine-tuning on WMT

WMT configs could be found in ./dp_configs/wmt

Training with Horovod+DeepPavlov:

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7; horovodrun --gloo -np 8 python -m deeppavlov train ./dp_configs/ende_hvd.json

Multi-gpu training and evaluating with evaluate_model.py (recommended):

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7; python evaluate_model.py single \
        --pretrained-checkpoint ./runs/small_wiki_bs_128/model_1100000.pth \
        --task-config ./dp_configs/wmt/ende.json \
        --suffix bs_128_hvd/run_0 \
        --train-batch-size 16 \
        --lr 5e-05

BERT pretraining

Data preprocessing readme

FP16 for pretraining

add --fp16 and --apex_opt_lvl O2 or --apex_opt_lvl O1 (default) as arguments to run_t5_pretraining.py

Adafactor optimizer

Adafactor was used to train such models as T5, BigBird, PaLM and others. Adafactor lowers required memory by keeping moving average of per-parameter second moments factorized.

Adafactor parameters:

  • scale_parameter - lr is scaled by root mean square of parameter: lr * RMS(p)
  • relative_step - lr = 1/sqrt(step)
  • warmup_init - linear warm up from 1e-06 to 0.01 at 10k steps, works only in combination with relative_step

Adafactor can be used with constant lr / lr schedulers. In this case, relative_step and warmup_init should be set to False. scale_parameter is does not depend on learning rate schedules and can be used with external learning rates.

example for pretraining scripts:

--optimizer Adafactor --lr 1e-03 --scale_parameter \
--lr_scheduler constant_with_warmup --num_warmup_steps 10000

e.g. for DP config

"optimizer": "Adafactor",
"optimizer_parameters": {
        "lr": 1e-03,
        "weight_decay": 0.0,
        "scale_parameter": true,
        "relative_step": false,
        "warmup_init": false
}

Sparse Attention

BERT model training supports sparse attentions from DeepSpeed.

DeepSpeed Sparse attention docpage -- https://www.deepspeed.ai/tutorials/sparse-attention.

Configure Sparse Attention

SparseAttention parameters are passed to the model with HF model configuration file:

"sparse_config_cls": "deepspeed.ops.sparse_attention:BigBirdSparsityConfig",
"sparse_attention": {
  "num_heads": 12,
  "block": 16,
  "different_layout_per_head": true,
  "num_sliding_window_blocks": 1,
  "num_global_blocks": 1,
  "num_random_blocks": 1
}

You can also check bert_base_uncased-4L_sparse.json config example in bert_configs folder.