Questions about the provided fine-tuning model parameters
LanXingXuan opened this issue · 1 comments
Hello author, when I tested the performance of the provided timechat_7b.pth, I found that the measured indicators were lower than the results reported in the paper. I fine-tuned Timechat according to the requirements in the paper, the measured performance was higher than the provided timechat_7b.pth result. I would like to ask if there is something wrong with my fine-tuning/testing phase? Or are there errors in the fine-tuned model parameters provided?
Here are the results I got from testing the provided fine-tuning parameters timechat_7b.pth: (Because some videos are lost, the test data is nearly 20 less, but I guess it will not have a big impact on the results)
[val] gt video nums 396; pred video nums 396 gt video nums 396; pred video nums 396 evaluate data samples: 396 gt file: paragraph video captioning Para_CIDER 2.5 Para_METEOR 6.7 dense video captioning CIDER 2.4 METEOR 0.9 Precision@0.3 26.8 Recall@0.3 26.7 Precision@0.5 8.9 Recall@0.5 9.9 Precision@0.7 2.1 Recall@0.7 2.9 Precision@0.9 0.4 Recall@0.9 0.6 Precision_Mean 9.5 Recall_Mean 10.0 F1_Score 8.7 SODA_c_2 0.9 n_preds 7.6 SODA_c_1 -100.0
The following are the results of the fine-tuned checkpoint_2.pth that I reproduced myself:
[val] gt video nums 396; pred video nums 396 gt video nums 396; pred video nums 396 evaluate data samples: 396 gt file: paragraph video captioning Para_CIDER 2.1 Para_METEOR 8.1 dense video captioning CIDER 2.8 METEOR 1.0 Precision@0.3 31.1 Recall@0.3 43.5 Precision@0.5 11.0 Recall@0.5 17.9 Precision@0.7 3.4 Recall@0.7 6.3 Precision@0.9 0.4 Recall@0.9 0.8 Precision_Mean 11.5 Recall_Mean 17.1 F1_Score 12.4 SODA_c_2 1.2 n_preds 11.0 SODA_c_1 -100.0
Hi, thanks for your interest.
The zero-shot performance of timechat_7b.pth
on YouCook2 (as shown in the following img) should be higher than the results in our paper:
If not, please check that if you correctly transform the video to "youcook2_6fps_224"
(see https://github.com/RenShuhuai-Andy/TimeChat/blob/master/docs/DATA.md#compressing-videos)
Which dataset did you use for tine-tuning? TimeIT? If so, I think the results are comparable to the results in our paper.