Different results between training and eval
Opened this issue ยท 5 comments
Sorry to bother you! But I find another interesting problem.
When I start training (with train.json) and get a result in the middle process such as:
"epoch": 2304.0,
"eval_exact_match": 0.6460348162475822,
"eval_exec": 0.6460348162475822,
"eval_loss": 0.41825902462005615,
"eval_runtime": 90.718,
"eval_samples_per_second": 11.398,
"step": 2304
It can be seen that eval_exact_match is around 0.64.
But if I run evaluation mode (with eval.json), I will get:
"eval_exact_match": 0.6247582205029013,
"eval_exec": 0.6431334622823984,
"eval_loss": 0.41071268916130066,
"eval_runtime": 244.047,
"eval_samples": 1034,
"eval_samples_per_second": 4.237
The eval_exact_match is around 0.62
And the eval.json is
"run_name": "t5+picard-spider-eval",
"model_name_or_path": "train/checkpoint-2304",
"dataset": "spider",
"source_prefix": "",
"schema_serialization_type": "peteshaw",
"schema_serialization_randomized": false,
"schema_serialization_with_db_id": true,
"schema_serialization_with_db_content": true,
"normalize_query": true,
"target_with_db_id": true,
"output_dir": "/eval",
"cache_dir": "/transformers_cache",
"do_train": false,
"do_eval": true,
"fp16": false,
"per_device_eval_batch_size": 5,
"seed": 1,
"report_to": ["tensorboard"],
"predict_with_generate": true,
"num_beams": 4,
"num_beam_groups": 1,
"diversity_penalty": 0.0,
"max_val_samples": 1034,
"use_picard": false,
"launch_picard": false,
"picard_mode": "parse_with_guards",
"picard_schedule": "incremental",
"picard_max_tokens_to_check": 2,
"eval_accumulation_steps": 1,
"metric_config": "both",
"val_max_target_length": 512,
"val_max_time": 1200
It is different about 2%. Have you ever seen its problem?
Yes, I've encountered this problem. For this reason I always report the numbers that are reproducible based on the saved checkpoints and never those during training.
I have been unable to pinpoint the origin of the issue, I think though it has to do with mixed precision training and lossy conversions between floating point formats when saving the model weights. If I knew how to reproduce this in a minimal example I'd open an issue with hf transformers.
@eyuansu62 something I noticed: are you aware that your exact match and exec accuracies are identical? That doesn't seem right, have you made modifications to that code?
Another thought: the content matching code I borrowed from Victoria Lin et al's BRIDGE model does not necessarily produce the same column values between runs. This instability can explain the discrepancy partially but not fully. If you like to stare at diffs, try comparing the predictions_[step].json
files between training and evaluation.
something I noticed: are you aware that your exact match and exec accuracies are identical? That doesn't seem right, have you made modifications to that code?
I do not modify the metric code. And the same result seems to be a coincidence in 2304 epoch. Because there is:
"epoch": 3008.0,
"eval_exact_match": 0.6450676982591876,
"eval_exec": 0.6421663442940039,
"eval_loss": 0.45334360003471375,
"eval_runtime": 96.9869,
"eval_samples_per_second": 10.661,
"step": 3008
content matching code
Recently, I carefully compare the difference between training and evaluation. There are many kinds of error, such as key word error: asc, desc, wrong table name, wrong column name, etc.
Because I focus on exact match, the column values seem unimportant to me.