[BUG] did not output the eval results at all.

Question

[BUG] did not output the eval results at all.

xigua314 opened this issue 8 months ago · 3 comments

I went through the process of Finetuning (Full) gpt2, and I set --do_eval --eval_dataset_path xxx.json --eval_steps, where xxx.json is text2text, and the finetune process did not output the eval results at all. My finetune steps exceeded the eval_steps. I don't know if this is a bug or if there is a problem with my settings. I look forward to your answer, thank you very much!
Here is my detailed script setting,

deepspeed ${deepspeed_args}
examples/finetune.py
--model_name_or_path ${model_name_or_path}
--trust_remote_code ${trust_remote_code}
--dataset_path ${dataset_path}
--output_dir ${output_dir} --overwrite_output_dir
--conversation_template ${conversation_template}
--num_train_epochs 0.1
--learning_rate 2e-5
--disable_group_texts 1
--block_size 1024
--per_device_train_batch_size 18
--deepspeed configs/ds_config_zero3.json
--fp16
--run_name finetune
--validation_split_percentage 20
--eval_steps 20
--logging_steps 20
--do_train
--do_eval
--eval_dataset_path /h/s/x/l/eval
--ddp_timeout 72000
--save_steps 5000
--dataloader_num_workers 1
| tee ${log_dir}/train.log
2> ${log_dir}/train.err

Here is the last part of the log during my fine-tuning process，

05/08/2024 10:23:34 - WARNING - lmflow.pipeline.finetuner - finetuner_args.do_evalTrue
05/08/2024 10:23:34 - WARNING - lmflow.pipeline.finetuner - *************************************************************
[2024-05-08 10:23:38,301] [INFO] [partition_parameters.py:326:exit] finished initializing model with 1.64B parameters
05/08/2024 10:23:39 - WARNING - lmflow.pipeline.finetuner - in finetuner_args.do_eval ******************
05/08/2024 10:23:40 - WARNING - lmflow.pipeline.finetuner - ********************************************************************************
05/08/2024 10:23:40 - WARNING - lmflow.pipeline.finetuner - Number of eval samples: 256
ninja: no work to do.
Time to load cpu_adam op: 2.875669002532959 seconds
Parameter Offload: Total persistent parameters: 1001600 in 386 params
{'loss': 0.2962, 'grad_norm': 3.1811087335274615, 'learning_rate': 1.5714285714285715e-05, 'epoch': 0.02}
{'loss': 0.2991, 'grad_norm': 2.5313646679089503, 'learning_rate': 1.0952380952380955e-05, 'epoch': 0.05}
{'loss': 0.3155, 'grad_norm': 2.1892666086453594, 'learning_rate': 6.1904761904761914e-06, 'epoch': 0.07}
{'loss': 0.2972, 'grad_norm': 2.2230829824820884, 'learning_rate': 1.4285714285714286e-06, 'epoch': 0.1}
{'train_runtime': 451.4452, 'train_samples_per_second': 3.316, 'train_steps_per_second': 0.186, 'train_loss': 0.30037035260881695, 'epoch': 0.1}
***** train metrics *****
epoch = 0.101
total_flos = 4229GF
train_loss = 0.3004
train_runtime = 0:07:31.44
train_samples = 14972
train_samples_per_second = 3.316
train_steps_per_second = 0.186

Answer 1 · 2024-05-08T08:44:42.000Z

Thanks for your interest in LMFlow! You may try --eval_strategy steps (https://github.com/huggingface/transformers/blob/main/src/transformers/training_args.py#L237) and --eval_steps 1 (https://github.com/huggingface/transformers/blob/main/src/transformers/training_args.py#L411) to see if it works. Hope this information can be helpful 😄

Answer 2 · 2024-05-08T09:23:09.000Z

“Thank you very much for your reply. I tried adding --eval_strategy steps to the script and modified '--eval_steps', '1', but in the end, it reported an error: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--eval_strategy', 'steps'].

Thanks for your interest in LMFlow! You may try --eval_strategy steps (https://github.com/huggingface/transformers/blob/main/src/transformers/training_args.py#L237) and --eval_steps 1 (https://github.com/huggingface/transformers/blob/main/src/transformers/training_args.py#L411) to see if it works. Hope this information can be helpful 😄

Answer 3 · 2024-05-09T18:49:32.000Z

That's a bit strange. It would be nice if you could share your transformers version so we can check that for you? The argument is passed to huggingface trainer so it is expected be accepted normally: