finetune overflow,能不能提些建议
frankang opened this issue · 5 comments
您好,我在自有大约10000条问答数据上进行微调,但是发现结果不太行 (loss最低到 lm loss 3.978439E+00 | loss scale 0.2 |
),但用得到的checkpoint测试,还不如不训练。。而且训练从一开始其实就没有稳定下来,训练半天以后会变NaN。请问能提供一些建议吗?
我的数据10000条,设置训练目标为3000个updates (相当于10个epoch),训练的loss一直有在降,但是也一直在出现fp16 dynamic loss scale overflow!,训练到1300updates左右就崩了,loss变成NAN。
{
"train_micro_batch_size_per_gpu": 2,
"gradient_accumulation_steps": 4,
"steps_per_print": 20,
"gradient_clipping": 1,
"zero_optimization": {
"stage": 2,
"contiguous_gradients": false,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 50000000,
"allgather_bucket_size": 500000000,
"cpu_offload": true
},
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": true,
"loss_scale": 0,
"initial_scale_power": 20, #这个是新添加的,因为初始scale的值很大,用不了那么大的
"loss_scale_window": 64, #这个初始值貌似是1000,但是撑不了那么久就会再下降一格,所以暂时先改成64 了
"hysteresis": 3, #从2改3了
"min_loss_scale": 0.1#默认值1000,调0.1了,但还是崩
},
"optimizer": {
"type": "Adam",
"params": {
"lr": 1e-5,
"betas": [
0.9,
0.95
],
"eps": 1e-6, #原始值1e-8,会崩,看论文里微调bert用1e-6所以试试,还是崩
"weight_decay": 1e-2
}
},
"activation_checkpointing": {
"partition_activations": false,
"contiguous_memory_optimization": false
},
"wall_clock_breakdown": false
}
训练脚本
#! /bin/bash
# Change for multinode config
NUM_WORKERS=1
NUM_GPUS_PER_WORKER=1
MP_SIZE=1
script_path=$(realpath $0)
script_dir=$(dirname $script_path)
OPTIONS_NCCL="NCCL_DEBUG=info NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=2"
HOST_FILE_PATH="/root/code/config/hostfile"
config_json="$script_dir/ds_config_2.9B_finetune.json"
gpt_options=" \
--finetune \
--experiment-name txl-2.9b \
--model-parallel-size ${MP_SIZE} \
--num-layers 32 \
--hidden-size 2560 \
--num-attention-heads 32 \
--seq-length 512 \
--max-position-embeddings 512 \
--mem-length 256 \
--load ${1} \
--no-load-optim \
--save ./checkpoints \
--save-interval 300 \
--train-iters 3000 \ #我10000条数据在目前这个batchsize设置下大约300个updates跑完,所以这里设置了10个epoch
--resume-dataloader \
--train-data ${2} \
--xl-dataset \
--lazy-loader \
--tokenizer-type ChineseSPTokenizer \
--split 949,50,1 \
--distributed-backend nccl \
--lr-decay-style linear \
--lr-decay-ratio 0.5 \
--lr-decay-iters 3000 \
--no-load-lr-scheduler \ # 不加这一行,我打印(lr_scheduler.num_iters)出来里面的step刚开始就有160000,导致学习率lr schedule无法正常工作。--finetune的选项似乎没有正常work
--warmup 0.2 \ #0.1崩得更快,所以改0.2了
--checkpoint-activations \
--deepspeed-activation-checkpointing \
--transformer-xl \
--fp16 \
"
gpt_options="${gpt_options}
--deepspeed \
--deepspeed_config ${config_json} \
"
#run_cmd="${OPTIONS_NCCL} deepspeed --num_nodes ${NUM_WORKERS} --num_gpus ${NUM_GPUS_PER_WORKER} --hostfile ${HOST_FILE_PATH} --include localhost:3 pretrain_gpt2.py ${gpt_options}"
run_cmd="${OPTIONS_NCCL} deepspeed --include localhost:0,3,4,5 pretrain_gpt2.py ${gpt_options}"
echo ${run_cmd}
eval ${run_cmd}
set +x
我训练数据的例子
{"prompt": "问题:你好,今天太晚就聊到这里,保重身体吧。 回答:", "text": "晚安!"}
总的batch size是否应该更大一些?比如增加到64,可以增加GPU数或者是增大gradient_accumulation_steps
@duzx16 您好,我试了下更大的acuumulation step (从4调到8) 和更多显卡(从4块增加到6块),结果崩得更快了,下面两个日志分别是等效batchsize 32和96的结果。(这两次实验应该有加载checkpoint的optim states,不过之前不加载也不稳定)
日志1
等效bsz = 4*2*4= 32
"fp16": {
"enabled": true,
"loss_scale": 0,
"initial_scale_power": 20,
"loss_scale_window": 64,
"hysteresis": 3,
"min_loss_scale": 0.1
},
"optimizer": {
"type": "Adam",
"params": {
"lr": 1e-5,
"betas": [
0.9,
0.95
],
"eps": 1e-6,
"weight_decay": 1e-2
}
},
Pretrain GPT2 model
arguments:
transformer_xl ............... True
pretrained_bert .............. False
attention_dropout ............ 0.1
num_attention_heads .......... 32
hidden_size .................. 2560
intermediate_size ............ None
num_layers ................... 32
layernorm_epsilon ............ 1e-05
hidden_dropout ............... 0.1
max_position_embeddings ...... 512
vocab_size ................... 50048
deep_init .................... False
make_vocab_size_divisible_by . 128
cpu_optimizer ................ False
cpu_torch_adam ............... False
fp16 ......................... True
fp32_embedding ............... False
fp32_layernorm ............... False
fp32_tokentypes .............. False
fp32_allreduce ............... False
hysteresis ................... 2
loss_scale ................... None
loss_scale_window ............ 1000
min_scale .................... 1
experiment_name .............. txl-2.9b11-08-21-16
batch_size ................... 2
weight_decay ................. 0.01
checkpoint_activations ....... True
checkpoint_num_layers ........ 1
deepspeed_activation_checkpointing True
clip_grad .................... 1.0
train_iters .................. 3000
log_interval ................. 100
exit_interval ................ None
summary_dir ..................
seed ......................... 1234
reset_position_ids ........... False
reset_attention_mask ......... False
lr_decay_iters ............... 3000
lr_decay_style ............... linear
lr_decay_ratio ............... 0.5
lr ........................... 1e-05
warmup ....................... 0.2
save ......................... ./checkpoints/txl-2.9b11-08-21-16
save_interval ................ 300
no_save_optim ................ False
no_save_rng .................. False
load ......................... pretrained_models
no_load_optim ................ False
no_load_lr_scheduler ......... True
no_load_rng .................. False
finetune ..................... True
resume_dataloader ............ True
distributed_backend .......... nccl
local_rank ................... 0
eval_batch_size .............. None
eval_iters ................... 100
eval_interval ................ 1000
eval_seq_length .............. None
eval_max_preds_per_seq ....... None
overlapping_eval ............. 32
cloze_eval ................... False
eval_hf ...................... False
load_openai .................. False
temperature .................. 1.0
top_p ........................ 0.0
top_k ........................ 0
out_seq_length ............... 256
hierarchical ................. False
model_parallel_size .......... 1
shuffle ...................... False
train_data ................... ['/data/frank/data/transformer-xl/前文问答问答问题回答.jsonl']
xl_dataset ................... True
use_npy_data_loader .......... False
train_data_path ..............
val_data_path ................
test_data_path ............... None
input_data_sizes_file ........ sizes.txt
delim ........................ ,
text_key ..................... sentence
eval_text_key ................ None
valid_data ................... None
split ........................ 949,50,1
test_data .................... None
lazy_loader .................. True
loose_json ................... False
presplit_sentences ........... False
num_workers .................. 2
tokenizer_model_type ......... bert-large-uncased
tokenizer_path ............... tokenizer.model
tokenizer_type ............... ChineseSPTokenizer
not_pre_tokenize ............. False
cache_dir .................... None
use_tfrecords ................ False
seq_length ................... 512
mem_length ................... 256
max_preds_per_seq ............ None
sample_one_document .......... False
deepspeed .................... True
deepspeed_config ............. /data/frank/projects/Chinese-Transformer-XL/scripts/ds_config_2.9B_finetune.json
deepscale .................... False
deepscale_config ............. None
deepspeed_mpi ................ False
cuda ......................... True
rank ......................... 0
world_size ................... 4
dynamic_loss_scale ........... True
gradient_accumulation_steps .. 4
persist_state ................ 0
lazy ......................... False
transpose .................... False
data_set_type ................ GPT2
samples_per_shard ............ 100
do_train ..................... 1
do_valid ..................... 1
do_test ...................... 1
eod_token .................... 50000
iteration .................... 0
iteration .................... 0
iteration 100/ 3000 | elapsed time per iteration (ms): 11014.7 | learning rate 1.483E-06 | lm loss 8.847247E+00 | loss scale 2048.0 |
after 100 iterations memory (MB) | allocated: 6181.9599609375 | max allocated: 8412.134765625 | cached: 19180.0 | max cached: 19180.0
iteration 200/ 3000 | elapsed time per iteration (ms): 11074.8 | learning rate 3.117E-06 | lm loss 8.033987E+00 | loss scale 512.0 |
iteration 300/ 3000 | elapsed time per iteration (ms): 11001.6 | learning rate 4.717E-06 | lm loss 7.306797E+00 | loss scale 256.0 |
iteration 400/ 3000 | elapsed time per iteration (ms): 12879.0 | learning rate 6.300E-06 | lm loss 6.758929E+00 | loss scale 64.0 |
iteration 500/ 3000 | elapsed time per iteration (ms): 11092.2 | learning rate 7.917E-06 | lm loss 6.286667E+00 | loss scale 8.0 |
iteration 600/ 3000 | elapsed time per iteration (ms): 11052.7 | learning rate 9.517E-06 | lm loss 5.763325E+00 | loss scale 4.0 |
iteration 700/ 3000 | elapsed time per iteration (ms): 12932.3 | learning rate 9.773E-06 | lm loss NAN | loss scale 4.0 |
Warning: NaN or Inf found in input tensor.
iteration 800/ 3000 | elapsed time per iteration (ms): 11020.1 | learning rate 9.457E-06 | lm loss 5.193296E+00 | loss scale 0.1 |
iteration 900/ 3000 | elapsed time per iteration (ms): 11070.0 | learning rate 9.133E-06 | lm loss 4.792100E+00 | loss scale 0.1 |
iteration 1000/ 3000 | elapsed time per iteration (ms): 12955.2 | learning rate 8.807E-06 | lm loss 4.454510E+00 | loss scale 0.1 |
validation loss at iteration 1000 | LM loss: 4.503653E+00 | LM PPL: 9.034661E+01
iteration 1100/ 3000 | elapsed time per iteration (ms): 11197.8 | learning rate 8.493E-06 | lm loss 4.326191E+00 | loss scale 0.1 |
iteration 1200/ 3000 | elapsed time per iteration (ms): 11088.1 | learning rate 8.167E-06 | lm loss 3.978439E+00 | loss scale 0.2 |
iteration 1300/ 3000 | elapsed time per iteration (ms): 13006.3 | learning rate 7.847E-06 | lm loss NAN | loss scale 0.1 |
Warning: NaN or Inf found in input tensor.
iteration 1400/ 3000 | elapsed time per iteration (ms): 11047.3 | learning rate 7.530E-06 | lm loss 3.469355E+00 | loss scale 0.1 |
iteration 1500/ 3000 | elapsed time per iteration (ms): 11117.4 | learning rate 7.203E-06 | lm loss 3.124178E+00 | loss scale 0.2 |
iteration 1600/ 3000 | elapsed time per iteration (ms): 13136.5 | learning rate 6.887E-06 | lm loss NAN | loss scale 0.1 |
Warning: NaN or Inf found in input tensor.
iteration 1700/ 3000 | elapsed time per iteration (ms): 11076.4 | learning rate 6.567E-06 | lm loss 2.977889E+00 | loss scale 0.1 |
iteration 1800/ 3000 | elapsed time per iteration (ms): 11101.4 | learning rate 6.243E-06 | lm loss 2.579736E+00 | loss scale 0.1 |
iteration 1900/ 3000 | elapsed time per iteration (ms): 13170.0 | learning rate 5.923E-06 | lm loss NAN | loss scale 0.1 |
Warning: NaN or Inf found in input tensor.
iteration 2000/ 3000 | elapsed time per iteration (ms): 11097.2 | learning rate 5.600E-06 | lm loss NAN | loss scale 0.2 |
Warning: NaN or Inf found in input tensor.
validation loss at iteration 2000 | LM loss: 2.411779E+00 | LM PPL: 1.115379E+01
iteration 2100/ 3000 | elapsed time per iteration (ms): 11225.6 | learning rate 5.283E-06 | lm loss NAN | loss scale 0.1 |
Warning: NaN or Inf found in input tensor.
iteration 2200/ 3000 | elapsed time per iteration (ms): 13293.0 | learning rate 4.963E-06 | lm loss 2.172083E+00 | loss scale 0.1 |
iteration 2300/ 3000 | elapsed time per iteration (ms): 11127.7 | learning rate 4.633E-06 | lm loss 2.118679E+00 | loss scale 0.1 |
iteration 2400/ 3000 | elapsed time per iteration (ms): 11013.7 | learning rate 4.320E-06 | lm loss NAN | loss scale 0.1 |
Warning: NaN or Inf found in input tensor.
iteration 2500/ 3000 | elapsed time per iteration (ms): 13359.6 | learning rate 3.993E-06 | lm loss 1.715521E+00 | loss scale 0.2 |
iteration 2600/ 3000 | elapsed time per iteration (ms): 11067.3 | learning rate 3.673E-06 | lm loss 1.768682E+00 | loss scale 0.1 |
iteration 2700/ 3000 | elapsed time per iteration (ms): 11043.2 | learning rate 3.357E-06 | lm loss 1.695708E+00 | loss scale 0.1 |
iteration 2800/ 3000 | elapsed time per iteration (ms): 13426.8 | learning rate 3.030E-06 | lm loss 1.485154E+00 | loss scale 0.2 |
iteration 2900/ 3000 | elapsed time per iteration (ms): 11044.7 | learning rate 2.713E-06 | lm loss 1.359995E+00 | loss scale 0.1 |
iteration 3000/ 3000 | elapsed time per iteration (ms): 11047.0 | learning rate 2.397E-06 | lm loss 1.357495E+00 | loss scale 0.1 |
validation loss at iteration 3000 | LM loss: 1.516164E+00 | LM PPL: 4.554720E+00
日志2
等效bsz = 6*2*8= 96
"fp16": {
"enabled": true,
"loss_scale": 0,
"initial_scale_power": 20,
"loss_scale_window": 64,
"hysteresis": 3,
"min_loss_scale": 0.1
},
"optimizer": {
"type": "Adam",
"params": {
"lr": 1e-5,
"betas": [
0.9,
0.95
],
"eps": 1e-8,
"weight_decay": 1e-2
}
},
Pretrain GPT2 model
arguments:
transformer_xl ............... True
pretrained_bert .............. False
attention_dropout ............ 0.1
num_attention_heads .......... 32
hidden_size .................. 2560
intermediate_size ............ None
num_layers ................... 32
layernorm_epsilon ............ 1e-05
hidden_dropout ............... 0.1
max_position_embeddings ...... 512
vocab_size ................... 50048
deep_init .................... False
make_vocab_size_divisible_by . 128
cpu_optimizer ................ False
cpu_torch_adam ............... False
fp16 ......................... True
fp32_embedding ............... False
fp32_layernorm ............... False
fp32_tokentypes .............. False
fp32_allreduce ............... False
hysteresis ................... 2
loss_scale ................... None
loss_scale_window ............ 1000
min_scale .................... 1
experiment_name .............. txl-2.9b11-17-15-05
batch_size ................... 2
weight_decay ................. 0.01
checkpoint_activations ....... True
checkpoint_num_layers ........ 1
deepspeed_activation_checkpointing True
clip_grad .................... 1.0
train_iters .................. 3000
log_interval ................. 100
exit_interval ................ None
summary_dir ..................
seed ......................... 1234
reset_position_ids ........... False
reset_attention_mask ......... False
lr_decay_iters ............... 3000
lr_decay_style ............... linear
lr_decay_ratio ............... 0.5
lr ........................... 1e-05
warmup ....................... 0.2
save ......................... ./checkpoints/txl-2.9b11-17-15-05
save_interval ................ 300
no_save_optim ................ False
no_save_rng .................. False
load ......................... pretrained_models
no_load_optim ................ False
no_load_lr_scheduler ......... True
no_load_rng .................. False
finetune ..................... True
resume_dataloader ............ True
distributed_backend .......... nccl
local_rank ................... 0
eval_batch_size .............. None
eval_iters ................... 100
eval_interval ................ 1000
eval_seq_length .............. None
eval_max_preds_per_seq ....... None
overlapping_eval ............. 32
cloze_eval ................... False
eval_hf ...................... False
load_openai .................. False
temperature .................. 1.0
top_p ........................ 0.0
top_k ........................ 0
out_seq_length ............... 256
hierarchical ................. False
model_parallel_size .......... 1
shuffle ...................... False
train_data ................... ['/data/frank/data/transformer-xl/前文问答问答问题回答.jsonl']
xl_dataset ................... True
use_npy_data_loader .......... False
train_data_path ..............
val_data_path ................
test_data_path ............... None
input_data_sizes_file ........ sizes.txt
delim ........................ ,
text_key ..................... sentence
eval_text_key ................ None
valid_data ................... None
split ........................ 949,50,1
test_data .................... None
lazy_loader .................. True
loose_json ................... False
presplit_sentences ........... False
num_workers .................. 2
tokenizer_model_type ......... bert-large-uncased
tokenizer_path ............... tokenizer.model
tokenizer_type ............... ChineseSPTokenizer
not_pre_tokenize ............. False
cache_dir .................... None
use_tfrecords ................ False
seq_length ................... 512
mem_length ................... 256
max_preds_per_seq ............ None
sample_one_document .......... False
deepspeed .................... True
deepspeed_config ............. /data/frank/projects/Chinese-Transformer-XL/scripts/ds_config_2.9B_finetune.json
deepscale .................... False
deepscale_config ............. None
deepspeed_mpi ................ False
cuda ......................... True
rank ......................... 0
world_size ................... 6
dynamic_loss_scale ........... True
gradient_accumulation_steps .. 8
persist_state ................ 0
lazy ......................... False
transpose .................... False
data_set_type ................ GPT2
samples_per_shard ............ 100
do_train ..................... 1
do_valid ..................... 1
do_test ...................... 1
eod_token .................... 50000
iteration .................... 0
iteration .................... 0
iteration 100/ 3000 | elapsed time per iteration (ms): 19744.9 | learning rate 1.350E-06 | lm loss 8.783203E+00 | loss scale 8.0 |
after 100 iterations memory (MB) | allocated: 6181.9599609375 | max allocated: 8411.11572265625 | cached: 19192.0 | max cached: 19192.0
iteration 200/ 3000 | elapsed time per iteration (ms): 19755.2 | learning rate 2.783E-06 | lm loss 7.892145E+00 | loss scale 0.1 |
没训练完因为感觉很快就要NaN了。。。
我的训练数据里有极少量数据(约1/500)回答为空,("text": ""),我把这样的句子删掉,同时略微调大warmup,(或者增大batch、seq -length,调小学习率等)训练就正常了。
不知道是模型本身的问题还是训练代码bug,对异常数据比较敏感。
Hallo, can u provide ur device info? I got MOO error with 4 * k80
@yinxiangshi Titan RTX with 24G memory. (I have to use "cpu_offload": true
to finetune the model)