OptimalScale/LMFlow

How to verify the training is successful

oldunclez opened this issue · 3 comments

I made a lora training using the following cmd on very few data:

./scripts/run_finetune_with_lora.sh --model_name_or_path /tmp/chatglm2-6b --dataset_path data/mydata/ --output_lora_path output_models/finetuned_chatglm2-6b ,

{
  "type": "text_only",
  "instances": [
    {
      "text": "好学学园图书管理办法\\n\\n第一章 总则\\n\\n第一条 目的\\n\\n为加强好学学园图书角管理工作,提升员工整体业务能力以及文化素养,规范图书管理,提高图书使用效率,营造公司良好的学 习氛围,特制订本管理办法。\\n\\n第二条 适用范围\\n\\n本管理办法适用于xxxx全体员工。第二章 图书管理\\n\\n第三条 图书借阅\\n\\n1. 图书借阅时间为每周二、周四的上午9:00-11:00。\\n\\n2.  每人每次借阅数量不得超过1本,借阅期限为30天(逢休息或节假日顺延至下一工作日)。\\n\\n3. 员工借阅前查看《图书一览表》,确认目标图书名称编号,到图书管理员处填写《借阅登记表》,方可领取书籍。\\n\\n4. 员工借阅到期应及时归还,图书管理员应进行验收,在《借阅登记表》上登记归还时间,并进行签字;不可办理续借,如未读完书籍,需重新借阅。\\n\\n5. 若逾期未归还暂停其借阅资格一个月(从书本还清日开始计算)。\\n\\n6. 公司员工在解雇、辞职时,须将所借图书归还,否则不予办理相关手续。7. 员工所借图书,如遇盘点或因工作需要须收回时,借书人不得拒绝。\\n\\n第四条 损坏赔偿\\n\\n1.员工 必须爱护书籍,所借书籍不得污损、撕剪、圈点、批注、折角和遗失等。如有上述情况,应以相同版本的新书赔偿,不得以其他图书抵充。如确实无法购到新书,不论新旧图书以原价两倍赔偿。\\n\\n2. 成套书籍 遗失其中一册(如上、下册书遗失其中一册,多卷书遗失其中一卷)应按全套书原价赔偿。(余下各卷不给赔偿人)。\\n\\n3. 读者所借书籍如有遗失,应在15天内办理赔偿。如经过催促,在一个月内未办理赔偿 手续者,三倍处罚。\\n\\n4. 员工借阅书籍时,当场检查,如发现污损等问题及时声明,否则归还时发现污损,概由借阅者负责。5. 归还书籍时管理员检查是否破损,如有及时处理赔款事宜。\\n\\n第五条 图书 日常管理\\n\\n1. 图书由人力资源部负责图书日常管理工作,并负责图书的购买、整理、盘点等事宜。\\n\\n2. 图书按照分类科目进行整理编号,将书名、出版社名称、册数、金额及其他有关资料详细登记,建立《图书一览表》。\\n\\n3. 对所保管的图书资料,做到防尘、防潮、防火、防霉、防虫、防鼠、防盗;对损坏的图书资料应及时修补,保证其完整性。\\n\\n4. 新购入的书籍,应及时更新《图书一览表》,妥善入柜。\\n\\n5. 每个月图书管理人要对图书借阅情况进行核实;每个季度进行一次图书盘点,对图书遗失、破损等情况进行统一报批处理。\\n\\n6. 鼓励员工捐赠闲置书籍,捐赠书籍归入图书角。\\n\\n"
    }
  ]
}

It took about 1min to finish .(it seems that it is too quick)
Does it mean the training is successful ?
why does it not show something like

{'loss': 1.6098  learning_rate': 0.0007254578491847177  epoch': 0.82}
{'loss': 1.6215  learning_rate': 0.0007182852771003682  epoch': 0.853}
.....

The output of the cmd is :

[2024-01-24 10:51:21,508] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-24 10:51:22,976] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-01-24 10:51:22,976] [INFO] [runner.py:555:main] cmd = /usr/local/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=11000 --enable_each_rank_log=None examples/finetune.py --trust_remote_code 1 --model_name_or_path /tmp/chatglm2-6b --dataset_path data/mydata/ --output_dir output_models/finetuned_chatglm2-6b --overwrite_output_dir --num_train_epochs 3 --learning_rate 1e-4 --block_size 512 --per_device_train_batch_size 1 --use_lora 1 --lora_r 8 --save_aggregated_lora 0 --deepspeed configs/ds_config_zero2.json --fp16 --run_name finetune_with_lora --validation_split_percentage 0 --logging_steps 20 --do_train --ddp_timeout 72000 --save_steps 5000 --dataloader_num_workers 1
[2024-01-24 10:51:25,628] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-24 10:51:26,962] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0]}
[2024-01-24 10:51:26,962] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=1, node_rank=0
[2024-01-24 10:51:26,962] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2024-01-24 10:51:26,962] [INFO] [launch.py:163:main] dist_world_size=1
[2024-01-24 10:51:26,962] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
[2024-01-24 10:51:30,173] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/usr/local/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
[2024-01-24 10:51:32,518] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-01-24 10:51:32,518] [INFO] [comm.py:616:init_distributed] cdb=None
[2024-01-24 10:51:32,518] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
01/24/2024 10:51:33 - WARNING - lmflow.pipeline.finetuner - Process rank: 0, device: cuda:0, n_gpu: 1,distributed training: True, 16-bits training: True
/usr/local/lib/python3.10/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=None' instead.
  warnings.warn(
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:28<00:00,  4.07s/it]
trainable params: 1,949,696 || all params: 6,245,533,696 || trainable%: 0.031217444255383614
 [WARNING]  cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /root/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py310_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.473518133163452 seconds
Rank: 0 partition count [1] and sizes[(1949696, False)]
wandb: Tracking run with wandb version 0.14.0
wandb: W&B syncing is set to `offline` in this directory.
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
{'train_runtime': 4.9797, 'train_samples_per_second': 0.602, 'train_steps_per_second': 0.602, 'train_loss': 2.7109375, 'epoch': 3.0}
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00,  1.25s/it]
***** train metrics *****
  epoch                    =        3.0
  train_loss               =     2.7109
  train_runtime            = 0:00:04.97
  train_samples            =          1
  train_samples_per_second =      0.602
  train_steps_per_second   =      0.602
wandb: Waiting for W&B process to finish... (success).
wandb:
wandb: Run history:
wandb:                    train/epoch ▁
wandb:              train/global_step ▁
wandb:               train/total_flos ▁
wandb:               train/train_loss ▁
wandb:            train/train_runtime ▁
wandb: train/train_samples_per_second ▁
wandb:   train/train_steps_per_second ▁
wandb:
wandb: Run summary:
wandb:                    train/epoch 3.0
wandb:              train/global_step 3
wandb:               train/total_flos 55104262635520.0
wandb:               train/train_loss 2.71094
wandb:            train/train_runtime 4.9797
wandb: train/train_samples_per_second 0.602
wandb:   train/train_steps_per_second 0.602
wandb:
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /app/LMFlow/wandb/offline-run-20240124_105210-tdonnymg
wandb: Find logs at: ./wandb/offline-run-20240124_105210-tdonnymg/logs
[2024-01-24 10:52:17,034] [INFO] [launch.py:347:main] Process 1450 exits successfully.

the lora files have been created :

root@590a806f53bb:/app/LMFlow# ls -al output_models/finetuned_chatglm2-6b/
total 4876
drwxr-xr-x 5 root root    4096 Jan 24 10:52 .
drwxr-xr-x 3 root root    4096 Jan 24 10:52 ..
-rw-r--r-- 1 root root    1104 Jan 24 10:52 README.md
-rw-r--r-- 1 root root     477 Jan 24 10:52 adapter_config.json
-rw-r--r-- 1 root root 3909821 Jan 24 10:52 adapter_model.bin
-rw-r--r-- 1 root root     178 Jan 24 10:52 all_results.json
drwxr-xr-x 3 root root    4096 Jan 24 10:52 checkpoint-1
drwxr-xr-x 3 root root    4096 Jan 24 10:52 checkpoint-2
drwxr-xr-x 3 root root    4096 Jan 24 10:52 checkpoint-3
-rw-r--r-- 1 root root       3 Jan 24 10:52 special_tokens_map.json
-rwxr-xr-x 1 root root   10318 Jan 24 10:52 tokenization_chatglm.py
-rw-r--r-- 1 root root 1018370 Jan 24 10:52 tokenizer.model
-rw-r--r-- 1 root root     325 Jan 24 10:52 tokenizer_config.json
-rw-r--r-- 1 root root     178 Jan 24 10:52 train_results.json
-rw-r--r-- 1 root root     636 Jan 24 10:52 trainer_state.json

maybe the epoch time and the length of the training data are too small , if increasing the epoch time to 100 , it will shows something like

{'loss': 1.6098  learning_rate': 0.0007254578491847177  epoch': 0.82}

Thanks for your interest in LMFlow! If the training data set is very small, then the training should be finished very quickly, especially when you are using parameter-efficient fine-tuning methods like LoRA. So the results look normal to me 😄