THUDM/Chinese-Transformer-XL

finetune overflow,能不能提些建议

frankang opened this issue · 5 comments

您好,我在自有大约10000条问答数据上进行微调,但是发现结果不太行 (loss最低到 lm loss 3.978439E+00 | loss scale 0.2 |
),但用得到的checkpoint测试,还不如不训练。。而且训练从一开始其实就没有稳定下来,训练半天以后会变NaN。请问能提供一些建议吗?
我的数据10000条,设置训练目标为3000个updates (相当于10个epoch),训练的loss一直有在降,但是也一直在出现fp16 dynamic loss scale overflow!,训练到1300updates左右就崩了,loss变成NAN。

{
  "train_micro_batch_size_per_gpu": 2,
  "gradient_accumulation_steps": 4,
  "steps_per_print": 20,
  "gradient_clipping": 1,
  "zero_optimization": {
    "stage": 2,
    "contiguous_gradients": false,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 50000000,
    "allgather_bucket_size": 500000000,
    "cpu_offload": true
  },
  "zero_allow_untested_optimizer": true,
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "initial_scale_power": 20, #这个是新添加的,因为初始scale的值很大,用不了那么大的
    "loss_scale_window": 64, #这个初始值貌似是1000,但是撑不了那么久就会再下降一格,所以暂时先改成64 了
    "hysteresis": 3, #从2改3了
    "min_loss_scale": 0.1#默认值1000,调0.1了,但还是崩
  },
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 1e-5,
      "betas": [
        0.9,
        0.95
      ],
      "eps": 1e-6, #原始值1e-8,会崩,看论文里微调bert用1e-6所以试试,还是崩
      "weight_decay": 1e-2
    }
  },
  "activation_checkpointing": {
    "partition_activations": false,
    "contiguous_memory_optimization": false
  },
  "wall_clock_breakdown": false
}

训练脚本

#! /bin/bash

# Change for multinode config

NUM_WORKERS=1
NUM_GPUS_PER_WORKER=1
MP_SIZE=1

script_path=$(realpath $0)
script_dir=$(dirname $script_path)

OPTIONS_NCCL="NCCL_DEBUG=info NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=2"
HOST_FILE_PATH="/root/code/config/hostfile"


config_json="$script_dir/ds_config_2.9B_finetune.json"
gpt_options=" \
       --finetune \
       --experiment-name txl-2.9b \
       --model-parallel-size ${MP_SIZE} \
       --num-layers 32 \
       --hidden-size 2560 \
       --num-attention-heads 32 \
       --seq-length 512 \
       --max-position-embeddings 512 \
       --mem-length 256 \
       --load ${1} \
       --no-load-optim \
       --save ./checkpoints \
       --save-interval 300 \
       --train-iters 3000 \ #我10000条数据在目前这个batchsize设置下大约300个updates跑完,所以这里设置了10个epoch
       --resume-dataloader \
       --train-data ${2} \
       --xl-dataset \
       --lazy-loader \
       --tokenizer-type ChineseSPTokenizer \
       --split 949,50,1 \
       --distributed-backend nccl \
       --lr-decay-style linear \
       --lr-decay-ratio 0.5 \
       --lr-decay-iters 3000 \
       --no-load-lr-scheduler \ # 不加这一行,我打印(lr_scheduler.num_iters)出来里面的step刚开始就有160000,导致学习率lr schedule无法正常工作。--finetune的选项似乎没有正常work 
       --warmup 0.2 \ #0.1崩得更快,所以改0.2了
       --checkpoint-activations \
       --deepspeed-activation-checkpointing \
       --transformer-xl \
       --fp16 \
"
gpt_options="${gpt_options}
               --deepspeed \
               --deepspeed_config ${config_json} \
"


#run_cmd="${OPTIONS_NCCL} deepspeed --num_nodes ${NUM_WORKERS} --num_gpus ${NUM_GPUS_PER_WORKER} --hostfile ${HOST_FILE_PATH} --include localhost:3 pretrain_gpt2.py ${gpt_options}"
run_cmd="${OPTIONS_NCCL} deepspeed  --include localhost:0,3,4,5 pretrain_gpt2.py ${gpt_options}"
echo ${run_cmd}
eval ${run_cmd}

set +x

我训练数据的例子
{"prompt": "问题:你好,今天太晚就聊到这里,保重身体吧。 回答:", "text": "晚安!"}

总的batch size是否应该更大一些?比如增加到64,可以增加GPU数或者是增大gradient_accumulation_steps

@duzx16 您好,我试了下更大的acuumulation step (从4调到8) 和更多显卡(从4块增加到6块),结果崩得更快了,下面两个日志分别是等效batchsize 32和96的结果。(这两次实验应该有加载checkpoint的optim states,不过之前不加载也不稳定)

日志1

等效bsz = 4*2*4= 32
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "initial_scale_power": 20,
    "loss_scale_window": 64,
    "hysteresis": 3,
    "min_loss_scale": 0.1
  },
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 1e-5,
      "betas": [
        0.9,
        0.95
      ],
      "eps": 1e-6,
      "weight_decay": 1e-2
    }
  },

Pretrain GPT2 model
arguments:
  transformer_xl ............... True
  pretrained_bert .............. False
  attention_dropout ............ 0.1
  num_attention_heads .......... 32
  hidden_size .................. 2560
  intermediate_size ............ None
  num_layers ................... 32
  layernorm_epsilon ............ 1e-05
  hidden_dropout ............... 0.1
  max_position_embeddings ...... 512
  vocab_size ................... 50048
  deep_init .................... False
  make_vocab_size_divisible_by . 128
  cpu_optimizer ................ False
  cpu_torch_adam ............... False
  fp16 ......................... True
  fp32_embedding ............... False
  fp32_layernorm ............... False
  fp32_tokentypes .............. False
  fp32_allreduce ............... False
  hysteresis ................... 2
  loss_scale ................... None
  loss_scale_window ............ 1000
  min_scale .................... 1
  experiment_name .............. txl-2.9b11-08-21-16
  batch_size ................... 2
  weight_decay ................. 0.01
  checkpoint_activations ....... True
  checkpoint_num_layers ........ 1
  deepspeed_activation_checkpointing  True
  clip_grad .................... 1.0
  train_iters .................. 3000
  log_interval ................. 100
  exit_interval ................ None
  summary_dir ..................
  seed ......................... 1234
  reset_position_ids ........... False
  reset_attention_mask ......... False
  lr_decay_iters ............... 3000
  lr_decay_style ............... linear
  lr_decay_ratio ............... 0.5
  lr ........................... 1e-05
  warmup ....................... 0.2
  save ......................... ./checkpoints/txl-2.9b11-08-21-16
  save_interval ................ 300
  no_save_optim ................ False
  no_save_rng .................. False
  load ......................... pretrained_models
  no_load_optim ................ False
  no_load_lr_scheduler ......... True
  no_load_rng .................. False
  finetune ..................... True
  resume_dataloader ............ True
  distributed_backend .......... nccl
  local_rank ................... 0
  eval_batch_size .............. None
  eval_iters ................... 100
  eval_interval ................ 1000
  eval_seq_length .............. None
  eval_max_preds_per_seq ....... None
  overlapping_eval ............. 32
  cloze_eval ................... False
  eval_hf ...................... False
  load_openai .................. False
  temperature .................. 1.0
  top_p ........................ 0.0
  top_k ........................ 0
  out_seq_length ............... 256
  hierarchical ................. False
  model_parallel_size .......... 1
  shuffle ...................... False
  train_data ................... ['/data/frank/data/transformer-xl/前文问答问答问题回答.jsonl']
  xl_dataset ................... True
  use_npy_data_loader .......... False
  train_data_path ..............
  val_data_path ................
  test_data_path ............... None
  input_data_sizes_file ........ sizes.txt
  delim ........................ ,
  text_key ..................... sentence
  eval_text_key ................ None
  valid_data ................... None
  split ........................ 949,50,1
  test_data .................... None
  lazy_loader .................. True
  loose_json ................... False
  presplit_sentences ........... False
  num_workers .................. 2
  tokenizer_model_type ......... bert-large-uncased
  tokenizer_path ............... tokenizer.model
  tokenizer_type ............... ChineseSPTokenizer
  not_pre_tokenize ............. False
  cache_dir .................... None
  use_tfrecords ................ False
  seq_length ................... 512
  mem_length ................... 256
  max_preds_per_seq ............ None
  sample_one_document .......... False
  deepspeed .................... True
  deepspeed_config ............. /data/frank/projects/Chinese-Transformer-XL/scripts/ds_config_2.9B_finetune.json
  deepscale .................... False
  deepscale_config ............. None
  deepspeed_mpi ................ False
  cuda ......................... True
  rank ......................... 0
  world_size ................... 4
  dynamic_loss_scale ........... True
  gradient_accumulation_steps .. 4
  persist_state ................ 0
  lazy ......................... False
  transpose .................... False
  data_set_type ................ GPT2
  samples_per_shard ............ 100
  do_train ..................... 1
  do_valid ..................... 1
  do_test ...................... 1
  eod_token .................... 50000
  iteration .................... 0

  iteration .................... 0
 iteration      100/    3000 | elapsed time per iteration (ms): 11014.7 | learning rate 1.483E-06 | lm loss 8.847247E+00 | loss scale 2048.0 |
after 100 iterations memory (MB) | allocated: 6181.9599609375 | max allocated: 8412.134765625 | cached: 19180.0 | max cached: 19180.0
 iteration      200/    3000 | elapsed time per iteration (ms): 11074.8 | learning rate 3.117E-06 | lm loss 8.033987E+00 | loss scale 512.0 |
 iteration      300/    3000 | elapsed time per iteration (ms): 11001.6 | learning rate 4.717E-06 | lm loss 7.306797E+00 | loss scale 256.0 |
 iteration      400/    3000 | elapsed time per iteration (ms): 12879.0 | learning rate 6.300E-06 | lm loss 6.758929E+00 | loss scale 64.0 |
 iteration      500/    3000 | elapsed time per iteration (ms): 11092.2 | learning rate 7.917E-06 | lm loss 6.286667E+00 | loss scale 8.0 |
 iteration      600/    3000 | elapsed time per iteration (ms): 11052.7 | learning rate 9.517E-06 | lm loss 5.763325E+00 | loss scale 4.0 |
 iteration      700/    3000 | elapsed time per iteration (ms): 12932.3 | learning rate 9.773E-06 | lm loss NAN | loss scale 4.0 |
Warning: NaN or Inf found in input tensor.
 iteration      800/    3000 | elapsed time per iteration (ms): 11020.1 | learning rate 9.457E-06 | lm loss 5.193296E+00 | loss scale 0.1 |
 iteration      900/    3000 | elapsed time per iteration (ms): 11070.0 | learning rate 9.133E-06 | lm loss 4.792100E+00 | loss scale 0.1 |
 iteration     1000/    3000 | elapsed time per iteration (ms): 12955.2 | learning rate 8.807E-06 | lm loss 4.454510E+00 | loss scale 0.1 |
 validation loss at iteration 1000 | LM loss: 4.503653E+00 | LM PPL: 9.034661E+01
 iteration     1100/    3000 | elapsed time per iteration (ms): 11197.8 | learning rate 8.493E-06 | lm loss 4.326191E+00 | loss scale 0.1 |
 iteration     1200/    3000 | elapsed time per iteration (ms): 11088.1 | learning rate 8.167E-06 | lm loss 3.978439E+00 | loss scale 0.2 |
 iteration     1300/    3000 | elapsed time per iteration (ms): 13006.3 | learning rate 7.847E-06 | lm loss NAN | loss scale 0.1 |
Warning: NaN or Inf found in input tensor.
 iteration     1400/    3000 | elapsed time per iteration (ms): 11047.3 | learning rate 7.530E-06 | lm loss 3.469355E+00 | loss scale 0.1 |
 iteration     1500/    3000 | elapsed time per iteration (ms): 11117.4 | learning rate 7.203E-06 | lm loss 3.124178E+00 | loss scale 0.2 |
 iteration     1600/    3000 | elapsed time per iteration (ms): 13136.5 | learning rate 6.887E-06 | lm loss NAN | loss scale 0.1 |
Warning: NaN or Inf found in input tensor.
 iteration     1700/    3000 | elapsed time per iteration (ms): 11076.4 | learning rate 6.567E-06 | lm loss 2.977889E+00 | loss scale 0.1 |
 iteration     1800/    3000 | elapsed time per iteration (ms): 11101.4 | learning rate 6.243E-06 | lm loss 2.579736E+00 | loss scale 0.1 |
 iteration     1900/    3000 | elapsed time per iteration (ms): 13170.0 | learning rate 5.923E-06 | lm loss NAN | loss scale 0.1 |
Warning: NaN or Inf found in input tensor.
 iteration     2000/    3000 | elapsed time per iteration (ms): 11097.2 | learning rate 5.600E-06 | lm loss NAN | loss scale 0.2 |
Warning: NaN or Inf found in input tensor.
 validation loss at iteration 2000 | LM loss: 2.411779E+00 | LM PPL: 1.115379E+01
 iteration     2100/    3000 | elapsed time per iteration (ms): 11225.6 | learning rate 5.283E-06 | lm loss NAN | loss scale 0.1 |
Warning: NaN or Inf found in input tensor.
 iteration     2200/    3000 | elapsed time per iteration (ms): 13293.0 | learning rate 4.963E-06 | lm loss 2.172083E+00 | loss scale 0.1 |
 iteration     2300/    3000 | elapsed time per iteration (ms): 11127.7 | learning rate 4.633E-06 | lm loss 2.118679E+00 | loss scale 0.1 |
 iteration     2400/    3000 | elapsed time per iteration (ms): 11013.7 | learning rate 4.320E-06 | lm loss NAN | loss scale 0.1 |
Warning: NaN or Inf found in input tensor.
 iteration     2500/    3000 | elapsed time per iteration (ms): 13359.6 | learning rate 3.993E-06 | lm loss 1.715521E+00 | loss scale 0.2 |
 iteration     2600/    3000 | elapsed time per iteration (ms): 11067.3 | learning rate 3.673E-06 | lm loss 1.768682E+00 | loss scale 0.1 |
 iteration     2700/    3000 | elapsed time per iteration (ms): 11043.2 | learning rate 3.357E-06 | lm loss 1.695708E+00 | loss scale 0.1 |
 iteration     2800/    3000 | elapsed time per iteration (ms): 13426.8 | learning rate 3.030E-06 | lm loss 1.485154E+00 | loss scale 0.2 |
 iteration     2900/    3000 | elapsed time per iteration (ms): 11044.7 | learning rate 2.713E-06 | lm loss 1.359995E+00 | loss scale 0.1 |
 iteration     3000/    3000 | elapsed time per iteration (ms): 11047.0 | learning rate 2.397E-06 | lm loss 1.357495E+00 | loss scale 0.1 |
 validation loss at iteration 3000 | LM loss: 1.516164E+00 | LM PPL: 4.554720E+00

日志2

等效bsz = 6*2*8= 96
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "initial_scale_power": 20,
    "loss_scale_window": 64,
    "hysteresis": 3,
    "min_loss_scale": 0.1
  },
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 1e-5,
      "betas": [
        0.9,
        0.95
      ],
      "eps": 1e-8,
      "weight_decay": 1e-2
    }
  },
Pretrain GPT2 model
arguments:
  transformer_xl ............... True
  pretrained_bert .............. False
  attention_dropout ............ 0.1
  num_attention_heads .......... 32
  hidden_size .................. 2560
  intermediate_size ............ None
  num_layers ................... 32
  layernorm_epsilon ............ 1e-05
  hidden_dropout ............... 0.1
  max_position_embeddings ...... 512
  vocab_size ................... 50048
  deep_init .................... False
  make_vocab_size_divisible_by . 128
  cpu_optimizer ................ False
  cpu_torch_adam ............... False
  fp16 ......................... True
  fp32_embedding ............... False
  fp32_layernorm ............... False
  fp32_tokentypes .............. False
  fp32_allreduce ............... False
  hysteresis ................... 2
  loss_scale ................... None
  loss_scale_window ............ 1000
  min_scale .................... 1
  experiment_name .............. txl-2.9b11-17-15-05
  batch_size ................... 2
  weight_decay ................. 0.01
  checkpoint_activations ....... True
  checkpoint_num_layers ........ 1
  deepspeed_activation_checkpointing  True
  clip_grad .................... 1.0
  train_iters .................. 3000
  log_interval ................. 100
  exit_interval ................ None
  summary_dir ..................
  seed ......................... 1234
  reset_position_ids ........... False
  reset_attention_mask ......... False
  lr_decay_iters ............... 3000
  lr_decay_style ............... linear
  lr_decay_ratio ............... 0.5
  lr ........................... 1e-05
  warmup ....................... 0.2
  save ......................... ./checkpoints/txl-2.9b11-17-15-05
  save_interval ................ 300
  no_save_optim ................ False
  no_save_rng .................. False
  load ......................... pretrained_models
  no_load_optim ................ False
  no_load_lr_scheduler ......... True
  no_load_rng .................. False
  finetune ..................... True
  resume_dataloader ............ True
  distributed_backend .......... nccl
  local_rank ................... 0
  eval_batch_size .............. None
  eval_iters ................... 100
  eval_interval ................ 1000
  eval_seq_length .............. None
  eval_max_preds_per_seq ....... None
  overlapping_eval ............. 32
  cloze_eval ................... False
  eval_hf ...................... False
  load_openai .................. False
  temperature .................. 1.0
  top_p ........................ 0.0
  top_k ........................ 0
  out_seq_length ............... 256
  hierarchical ................. False
  model_parallel_size .......... 1
  shuffle ...................... False
  train_data ................... ['/data/frank/data/transformer-xl/前文问答问答问题回答.jsonl']
  xl_dataset ................... True
  use_npy_data_loader .......... False
  train_data_path ..............
  val_data_path ................
  test_data_path ............... None
  input_data_sizes_file ........ sizes.txt
  delim ........................ ,
  text_key ..................... sentence
  eval_text_key ................ None
  valid_data ................... None
  split ........................ 949,50,1
  test_data .................... None
  lazy_loader .................. True
  loose_json ................... False
  presplit_sentences ........... False
  num_workers .................. 2
  tokenizer_model_type ......... bert-large-uncased
  tokenizer_path ............... tokenizer.model
  tokenizer_type ............... ChineseSPTokenizer
  not_pre_tokenize ............. False
  cache_dir .................... None
  use_tfrecords ................ False
  seq_length ................... 512
  mem_length ................... 256
  max_preds_per_seq ............ None
  sample_one_document .......... False
  deepspeed .................... True
  deepspeed_config ............. /data/frank/projects/Chinese-Transformer-XL/scripts/ds_config_2.9B_finetune.json
  deepscale .................... False
  deepscale_config ............. None
  deepspeed_mpi ................ False
  cuda ......................... True
  rank ......................... 0
  world_size ................... 6
  dynamic_loss_scale ........... True
  gradient_accumulation_steps .. 8
  persist_state ................ 0
  lazy ......................... False
  transpose .................... False
  data_set_type ................ GPT2
  samples_per_shard ............ 100
  do_train ..................... 1
  do_valid ..................... 1
  do_test ...................... 1
  eod_token .................... 50000
  iteration .................... 0

  iteration .................... 0
 iteration      100/    3000 | elapsed time per iteration (ms): 19744.9 | learning rate 1.350E-06 | lm loss 8.783203E+00 | loss scale 8.0 |
after 100 iterations memory (MB) | allocated: 6181.9599609375 | max allocated: 8411.11572265625 | cached: 19192.0 | max cached: 19192.0
 iteration      200/    3000 | elapsed time per iteration (ms): 19755.2 | learning rate 2.783E-06 | lm loss 7.892145E+00 | loss scale 0.1 |
没训练完因为感觉很快就要NaN了。。。

我的训练数据里有极少量数据(约1/500)回答为空,("text": ""),我把这样的句子删掉,同时略微调大warmup,(或者增大batch、seq -length,调小学习率等)训练就正常了。
不知道是模型本身的问题还是训练代码bug,对异常数据比较敏感。

Hallo, can u provide ur device info? I got MOO error with 4 * k80

@yinxiangshi Titan RTX with 24G memory. (I have to use "cpu_offload": true to finetune the model)