OptimalScale/LMFlow

[BUG] After finetune, model generate useless tokens.[全量微调完,模型输出连续<sep>]

cauwulixuan opened this issue · 1 comments

Describe the bug
I am using LMFlow to manage our private LLM, which type is llama. I finished finetune with runing scripts/run_finetune.sh, then I do some inferences by runing scripts/run_app.sh, but got a lot of <sep>, by the way, <sep> is one of special tokens in our tokenizer.

I want to know, how does LMFlow padding the datasets? After doing some research, I found that, LMFlow just seperate all data, and each chunk size is block_size.

What did I miss? or how can I dig out what is going on here? Thanks.


我们使用LMFlow来管理我们的私有大模型,初步尝试finetune之后,推理的时候发现输出一堆连续的<sep>,我们分析原因很可能是finetune的时候,计算loss的时候没有屏蔽这些特殊token。

但是看代码的时候,我没有看到是如何组织dataset的,好像直接就是获取所有的dataset,然后转换成token_ids之后,根据block_size大小进行分块了,我没有看到有补全操作。

所以我的问题是,是否需要补全操作?或者我这个问题还有可能是哪些原因导致的呢?

期待大佬们的答复。

To Reproduce

  1. finetune script
cmd = /opt/conda/envs/lmflow/bin/python3.9 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=11000 --enable_each_rank_log=None examples/finetune.py --model_name_or_path /mnt/models/private/ --dataset_path /home/github/LMFlow/data/math_data/text_only --output_dir /home/github/LMFlow/output_models/finetune_math_text_only --overwrite_output_dir --num_train_epochs 4 --learning_rate 2e-5 --block_size 256 --per_device_train_batch_size 1 --deepspeed configs/ds_config_zero3.json --bf16 --run_name finetune --validation_split_percentage 0 --logging_steps 20 --do_train --ddp_timeout 72000 --save_steps 5000 --dataloader_num_workers 1
  1. loss related log
Parameter Offload: Total persistent parameters: 223232 in 121 params
{'loss': 2.4214, 'learning_rate': 1.9819168173598556e-05, 'epoch': 0.04}
{'loss': 0.0202, 'learning_rate': 1.9638336347197107e-05, 'epoch': 0.07}
{'loss': 0.0017, 'learning_rate': 1.9457504520795662e-05, 'epoch': 0.11}
{'loss': 0.001, 'learning_rate': 1.9276672694394213e-05, 'epoch': 0.14}
{'loss': 0.0013, 'learning_rate': 1.9095840867992768e-05, 'epoch': 0.18}
{'loss': 0.0003, 'learning_rate': 1.8915009041591322e-05, 'epoch': 0.22}
{'loss': 0.0003, 'learning_rate': 1.8734177215189874e-05, 'epoch': 0.25}
{'loss': 0.0005, 'learning_rate': 1.8553345388788428e-05, 'epoch': 0.29}
{'loss': 0.0001, 'learning_rate': 1.8372513562386983e-05, 'epoch': 0.33}
{'loss': 0.0001, 'learning_rate': 1.8191681735985537e-05, 'epoch': 0.36}
{'loss': 0.0003, 'learning_rate': 1.801084990958409e-05, 'epoch': 0.4}
{'loss': 0.0004, 'learning_rate': 1.783001808318264e-05, 'epoch': 0.43}
{'loss': 0.0009, 'learning_rate': 1.7649186256781194e-05, 'epoch': 0.47}
{'loss': 0.0005, 'learning_rate': 1.746835443037975e-05, 'epoch': 0.51}
{'loss': 0.0002, 'learning_rate': 1.72875226039783e-05, 'epoch': 0.54}
{'loss': 0.0003, 'learning_rate': 1.7106690777576855e-05, 'epoch': 0.58}
{'loss': 0.0002, 'learning_rate': 1.692585895117541e-05, 'epoch': 0.61}
{'loss': 0.0002, 'learning_rate': 1.6745027124773964e-05, 'epoch': 0.65}
{'loss': 0.0, 'learning_rate': 1.6564195298372515e-05, 'epoch': 0.69}
{'loss': 0.0004, 'learning_rate': 1.6383363471971066e-05, 'epoch': 0.72}
{'loss': 0.0, 'learning_rate': 1.620253164556962e-05, 'epoch': 0.76}
{'loss': 0.0001, 'learning_rate': 1.6021699819168176e-05, 'epoch': 0.8}
{'loss': 0.0, 'learning_rate': 1.584086799276673e-05, 'epoch': 0.83}
{'loss': 0.0002, 'learning_rate': 1.566003616636528e-05, 'epoch': 0.87}
{'loss': 0.0, 'learning_rate': 1.5479204339963836e-05, 'epoch': 0.9}
{'loss': 0.0004, 'learning_rate': 1.5298372513562387e-05, 'epoch': 0.94}
{'loss': 0.0005, 'learning_rate': 1.5117540687160942e-05, 'epoch': 0.98}
{'loss': 0.0007, 'learning_rate': 1.4936708860759495e-05, 'epoch': 1.01}
{'loss': 0.0001, 'learning_rate': 1.4755877034358048e-05, 'epoch': 1.05}
{'loss': 0.0013, 'learning_rate': 1.4575045207956602e-05, 'epoch': 1.08}
{'loss': 0.0003, 'learning_rate': 1.4394213381555155e-05, 'epoch': 1.12}
{'loss': 0.0006, 'learning_rate': 1.421338155515371e-05, 'epoch': 1.16}
{'loss': 0.0004, 'learning_rate': 1.403254972875226e-05, 'epoch': 1.19}
{'loss': 0.0001, 'learning_rate': 1.3851717902350814e-05, 'epoch': 1.23}
{'loss': 0.0013, 'learning_rate': 1.3670886075949368e-05, 'epoch': 1.27}
{'loss': 0.0002, 'learning_rate': 1.3490054249547921e-05, 'epoch': 1.3}
{'loss': 0.0001, 'learning_rate': 1.3309222423146476e-05, 'epoch': 1.34}
{'loss': 0.0, 'learning_rate': 1.3128390596745029e-05, 'epoch': 1.37}
{'loss': 0.0, 'learning_rate': 1.2947558770343582e-05, 'epoch': 1.41}
{'loss': 0.0, 'learning_rate': 1.2766726943942136e-05, 'epoch': 1.45}
{'loss': 0.0, 'learning_rate': 1.2585895117540687e-05, 'epoch': 1.48}
{'loss': 0.0, 'learning_rate': 1.240506329113924e-05, 'epoch': 1.52}
{'loss': 0.0003, 'learning_rate': 1.2224231464737795e-05, 'epoch': 1.56}
{'loss': 0.0, 'learning_rate': 1.2043399638336348e-05, 'epoch': 1.59}
{'loss': 0.0, 'learning_rate': 1.1862567811934902e-05, 'epoch': 1.63}
{'loss': 0.0, 'learning_rate': 1.1681735985533455e-05, 'epoch': 1.66}
{'loss': 0.0002, 'learning_rate': 1.150090415913201e-05, 'epoch': 1.7}
{'loss': 0.0, 'learning_rate': 1.1320072332730561e-05, 'epoch': 1.74}
{'loss': 0.0, 'learning_rate': 1.1139240506329114e-05, 'epoch': 1.77}
{'loss': 0.0, 'learning_rate': 1.0958408679927669e-05, 'epoch': 1.81}
{'loss': 0.0, 'learning_rate': 1.0777576853526221e-05, 'epoch': 1.84}
{'loss': 0.0, 'learning_rate': 1.0596745027124774e-05, 'epoch': 1.88}
{'loss': 0.0, 'learning_rate': 1.0415913200723329e-05, 'epoch': 1.92}
{'loss': 0.0001, 'learning_rate': 1.0235081374321882e-05, 'epoch': 1.95}
{'loss': 0.0, 'learning_rate': 1.0054249547920433e-05, 'epoch': 1.99}
{'loss': 0.0, 'learning_rate': 9.87341772151899e-06, 'epoch': 2.03}
{'loss': 0.0, 'learning_rate': 9.69258589511754e-06, 'epoch': 2.06}
{'loss': 0.0, 'learning_rate': 9.511754068716095e-06, 'epoch': 2.1}
{'loss': 0.0, 'learning_rate': 9.330922242314648e-06, 'epoch': 2.13}
{'loss': 0.0, 'learning_rate': 9.150090415913203e-06, 'epoch': 2.17}
{'loss': 0.003, 'learning_rate': 8.969258589511754e-06, 'epoch': 2.21}
{'loss': 0.0, 'learning_rate': 8.788426763110308e-06, 'epoch': 2.24}
{'loss': 0.0, 'learning_rate': 8.607594936708861e-06, 'epoch': 2.28}
{'loss': 0.0, 'learning_rate': 8.426763110307414e-06, 'epoch': 2.31}
{'loss': 0.0, 'learning_rate': 8.245931283905967e-06, 'epoch': 2.35}
{'loss': 0.0, 'learning_rate': 8.065099457504522e-06, 'epoch': 2.39}
{'loss': 0.0, 'learning_rate': 7.884267631103075e-06, 'epoch': 2.42}
{'loss': 0.0, 'learning_rate': 7.703435804701628e-06, 'epoch': 2.46}
{'loss': 0.0, 'learning_rate': 7.522603978300181e-06, 'epoch': 2.5}
{'loss': 0.0002, 'learning_rate': 7.341772151898735e-06, 'epoch': 2.53}
{'loss': 0.0, 'learning_rate': 7.160940325497288e-06, 'epoch': 2.57}
{'loss': 0.0, 'learning_rate': 6.980108499095841e-06, 'epoch': 2.6}
{'loss': 0.0, 'learning_rate': 6.799276672694395e-06, 'epoch': 2.64}
{'loss': 0.0, 'learning_rate': 6.618444846292948e-06, 'epoch': 2.68}
{'loss': 0.0, 'learning_rate': 6.437613019891501e-06, 'epoch': 2.71}
{'loss': 0.0, 'learning_rate': 6.256781193490055e-06, 'epoch': 2.75}
{'loss': 0.0, 'learning_rate': 6.075949367088608e-06, 'epoch': 2.78}
{'loss': 0.0, 'learning_rate': 5.895117540687162e-06, 'epoch': 2.82}
{'loss': 0.0, 'learning_rate': 5.7142857142857145e-06, 'epoch': 2.86}
{'loss': 0.0, 'learning_rate': 5.533453887884268e-06, 'epoch': 2.89}
{'loss': 0.0, 'learning_rate': 5.352622061482822e-06, 'epoch': 2.93}
{'loss': 0.0, 'learning_rate': 5.171790235081374e-06, 'epoch': 2.97}
{'loss': 0.0, 'learning_rate': 4.990958408679928e-06, 'epoch': 3.0}
{'loss': 0.0, 'learning_rate': 4.8101265822784815e-06, 'epoch': 3.04}
{'loss': 0.0, 'learning_rate': 4.6292947558770344e-06, 'epoch': 3.07}
{'loss': 0.0, 'learning_rate': 4.448462929475588e-06, 'epoch': 3.11}
{'loss': 0.0, 'learning_rate': 4.267631103074141e-06, 'epoch': 3.15}
{'loss': 0.0, 'learning_rate': 4.086799276672695e-06, 'epoch': 3.18}
{'loss': 0.0, 'learning_rate': 3.905967450271248e-06, 'epoch': 3.22}
{'loss': 0.0, 'learning_rate': 3.7251356238698015e-06, 'epoch': 3.25}
{'loss': 0.0, 'learning_rate': 3.544303797468355e-06, 'epoch': 3.29}
{'loss': 0.0, 'learning_rate': 3.3634719710669077e-06, 'epoch': 3.33}
{'loss': 0.0, 'learning_rate': 3.1826401446654614e-06, 'epoch': 3.36}
{'loss': 0.0, 'learning_rate': 3.0018083182640143e-06, 'epoch': 3.4}
{'loss': 0.0, 'learning_rate': 2.820976491862568e-06, 'epoch': 3.44}
{'loss': 0.0, 'learning_rate': 2.6401446654611214e-06, 'epoch': 3.47}
{'loss': 0.0, 'learning_rate': 2.4593128390596747e-06, 'epoch': 3.51}
{'loss': 0.0, 'learning_rate': 2.278481012658228e-06, 'epoch': 3.54}
{'loss': 0.0, 'learning_rate': 2.0976491862567814e-06, 'epoch': 3.58}
{'loss': 0.0, 'learning_rate': 1.9168173598553347e-06, 'epoch': 3.62}
{'loss': 0.0, 'learning_rate': 1.735985533453888e-06, 'epoch': 3.65}
{'loss': 0.0, 'learning_rate': 1.5551537070524413e-06, 'epoch': 3.69}
{'loss': 0.0, 'learning_rate': 1.3743218806509947e-06, 'epoch': 3.73}
{'loss': 0.0, 'learning_rate': 1.193490054249548e-06, 'epoch': 3.76}
{'loss': 0.0, 'learning_rate': 1.0126582278481013e-06, 'epoch': 3.8}
{'loss': 0.0, 'learning_rate': 8.318264014466547e-07, 'epoch': 3.83}
{'loss': 0.0, 'learning_rate': 6.509945750452081e-07, 'epoch': 3.87}
{'loss': 0.0, 'learning_rate': 4.7016274864376133e-07, 'epoch': 3.91}
{'loss': 0.0, 'learning_rate': 2.8933092224231465e-07, 'epoch': 3.94}
{'loss': 0.0001, 'learning_rate': 1.08499095840868e-07, 'epoch': 3.98}
{'train_runtime': 6757.3322, 'train_samples_per_second': 2.619, 'train_steps_per_second': 0.327, 'train_loss': 0.022251758879417122, 'epoch': 4.0}
***** train metrics *****
  epoch                    =        4.0
  train_loss               =     0.0223
  train_runtime            = 1:52:37.33
  train_samples            =       4424
  train_samples_per_second =      2.619
  train_steps_per_second   =      0.327
  1. related logs
The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 1
- eval_batch_size: 8
- seed: 42
- distributed_type: multi-GPU
- num_devices: 8
- total_train_batch_size: 8
- total_eval_batch_size: 64
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 4.0

### Training results



### Framework versions

- Transformers 4.33.1
- Pytorch 2.0.1+cu117
- Datasets 2.10.1
- Tokenizers 0.13.3
  1. math dataset info
{
    "type": "text_only",
    "instances": [
        {
            "text": "甲数除以乙数的商是1.5,如果甲数增加20,则甲数是乙的4倍。问:原来甲数?<sep>答:20/(4-1.5)*1.5<eod>"
        },
        {
            "text": "客车和货车分别从A、B两站同时相向开出,5小时后相遇。相遇后,两车仍按原速度前进,当它们相距196千米时,货车行了全程的80%,客车已行的路程与未行的路程比是3除以2。求A。问:B两站间的路程?<sep>答:196/(80%+((3)/(3+2))-1)<eod>"
        },
        {
            "text": "图书角有书30本,第一天借出了(1/5),第二天又还回5本。问:现在图书角有多少本书?<sep>答:30*(1-(1/5))+5<eod>"
        },
        {
            "text": "甲、乙两车同时从相距230千米的两地相向而行,3小时后两车还相距35千米。已知甲车每小时行48千米。问:乙车每小时行多少千米?<sep>答:(230-35)/3-48<eod>"
        },
        {
            "text": "果园里有苹果树300棵,比桔树多20%。问:桔树有多少棵?<sep>答:300/(1+20%)<eod>"
        },
        {
            "text": "某班学生参加数学兴趣小组,其中,参加的男生是全班人数的20%,参加的女生是全班人数的(2/7)多2人,不参加的人数比全班人数的(3/5)少5人。问:全班有多少人?<sep>答:(5-2)/(20%+(2/7)+(3/5)-1)<eod>"
  1. inference script
#!/bin/bash

CUDA_VISIBLE_DEVICES=4 accelerate launch --config_file configs/accelerator_singlegpu_config.yaml service/debug.py \
    --model_name_or_path "output_models/finetune_math_text_only/" \
    --torch_dtype bfloat16 \
    --end_string "<eod>" \
    --max_new_tokens 200
  1. inference result with train data
inputs:  图书角有书30本,第一天借出了(1/5),第二天又还回5本。问:现在图书角有多少本书?<sep>
inputs after tokenizer:  {'input_ids': tensor([[29871, 38810, 31432, 30417, 31900, 29941, 29900, 30346, 30214, 40913,
         33128, 32585, 29898, 29896, 29914, 29945, 29897, 30214, 35950, 32004,
         31994, 30742, 29945, 30346, 30267, 31658, 30383, 32030, 38810, 31432,
         37561, 37057, 29973, 77187]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

resonse:
tensor([[29871, 38810, 31432, 30417, 31900, 29941, 29900, 30346, 30214, 40913,
         33128, 32585, 29898, 29896, 29914, 29945, 29897, 30214, 35950, 32004,
         31994, 30742, 29945, 30346, 30267, 31658, 30383, 32030, 38810, 31432,
         37561, 37057, 29973, 77187, 77187, 77187, 77187, 77187, 77187, 77187,
         77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187,
         77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187,
         77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187,
         77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187,
         77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187,
         77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187,
         77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187,
         77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187,
         77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187,
         77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187,
         77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187,
         77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187,
         77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187,
         77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187,
         77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187,
         77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187, 77187]],
       device='cuda:0')

77187 is the token_id of <sep>

Any suggestions should be appreciate.

I figured it out: the forward method of our private model does not perform a shift operation on logits and labels, causing abnormal loss calculation. Can close this issue now.


我发现问题所在了:我们私有模型的forward方法没有对logits和labels做shift操作,导致loss计算异常。