xlang-ai/UnifiedSKG

The different results between eval mode and test mode.

eyuansu62 opened this issue · 14 comments

Why I get the different results between eval mode and test mode?
image

Hi,

Could you share the command you ran for this experiment?

The command is as follows:

python -m torch.distributed.launch --nproc_per_node 4 --master_port 12 train.py --seed 2 --cfg Salesforce/T5_3b_finetune_spider_with_cell_value.cfg --run_name T5_3b_finetune_spider_with_cell_value --logging_strategy steps --logging_first_step true --logging_steps 4 --evaluation_strategy steps --eval_steps 500 --metric_for_best_model avr --greater_is_better true --save_strategy steps --save_steps 500 --save_total_limit 1 --load_best_model_at_end --gradient_accumulation_steps 8 --num_train_epochs 400 --adafactor true --learning_rate 1e-4 --do_train --do_eval --do_predict --predict_with_generate --output_dir output/T5_large_finetune_spider_with_cell_value  --overwrite_output_dir --per_device_train_batch_size 2 --per_device_eval_batch_size 8 --generation_num_beams 1 --generation_max_length 512 --input_max_length 512 --ddp_find_unused_parameters true

Is the highest eval score the same as the test score?

The ckpt I chosen is the highest eval score during the training steps.
As you can see, it is different from the test score.

Can you run the following command on the same machine (which means that the previous checkpoints are still there) and see if the results are different?

python -m torch.distributed.launch --nproc_per_node 4 --master_port 12 train.py --seed 2 --cfg Salesforce/T5_3b_finetune_spider_with_cell_value.cfg --run_name T5_3b_finetune_spider_with_cell_value --logging_strategy steps --logging_first_step true --logging_steps 4 --evaluation_strategy steps --eval_steps 500 --metric_for_best_model avr --greater_is_better true --save_strategy steps --save_steps 500 --save_total_limit 1 --load_best_model_at_end --gradient_accumulation_steps 8 --num_train_epochs 0 --adafactor true --learning_rate 1e-4 --do_train --do_eval --do_predict --predict_with_generate --output_dir output/T5_large_finetune_spider_with_cell_value --per_device_train_batch_size 2 --per_device_eval_batch_size 8 --generation_num_beams 1 --generation_max_length 512 --input_max_length 512 --ddp_find_unused_parameters true

@eyuansu62 Hi, any new progress over there?
We double-checked our experiments log before and didn't find the case you showed, and we looked through the issues of PICARD and saw that you made similar issue in there too. It is very likely we are facing the same issue and same factor in your machine.

Hope we can figure that out together!

They are still a little different.
image

Could you double-check the evaluation and prediction json file?
It could help us with where the problem lies.

I check the evaluation and prediction json file, and find they are indeed different, no matter when do_train=False or num_train_epoch=0.

The different sqls are like follows, just a few conditions are wrong:
select singer.name from concert join singer_in_concert on concert.concert_id = singer_in_concert.concert_id where concert.year = 2014
select singer.name from concert join singer_in_concert on concert.concert_id = singer_in_concert.singer_id where concert.year = 2014

Okay, I will keep this issue active and see if anyone find similar problem!

I just realized that the command you provided is for T5-3b without using deepspeed. I remember that we didn't manage to run without deepspeed even on an A100. What kind of GPU are you using, if you remember?

Well, it is actually t5-large in this cfg file. I forget to change the file name.

Hey, we asked someone else for help to test it on his side and didn't get different result between eval mode and test mode(which is consistent with ours). Therefore we think it may because the machine in your side. Could you provide more info about hardware and system then?

image
image