showlab/EgoVLP

EPIC-Kitchens MIR Finetuning parameters

thechargedneutron opened this issue · 6 comments

Hi,

Thank you for your good work!

I am trying to finetune your egovlp.pth checkpoint to obtain the numbers that are reported in your paper. With this checkpoint I am able to get similar numbers for the zero-shot MIR experiment. However, I am not able to finetune the model to get the corresponding numbers reported in the paper. Can you please provide the hyperparameters that you used to get those results? In particular, number of nodes (and GPUs), batch size per GPU and learning rate.

For reference, I am training on one node with eight GPUs (batch size 4 per GPU) with lr=3e-5 and I get the following values for first 9 epochs

[mir_metrics]EpicKitchens_MIR epoch 1, nDCG_V2T: 53.056, nDCG_T2V: 49.662, nDCG_AVG: 51.359,, mAP_V2T: 40.498, mAP_T2V: 32.386, mAP_AVG: 36.442
[mir_metrics]EpicKitchens_MIR epoch 2, nDCG_V2T: 50.703, nDCG_T2V: 48.978, nDCG_AVG: 49.841,, mAP_V2T: 37.345, mAP_T2V: 30.398, mAP_AVG: 33.871
[mir_metrics]EpicKitchens_MIR epoch 3, nDCG_V2T: 52.399, nDCG_T2V: 50.459, nDCG_AVG: 51.429,, mAP_V2T: 37.134, mAP_T2V: 30.674, mAP_AVG: 33.904
[mir_metrics]EpicKitchens_MIR epoch 4, nDCG_V2T: 52.612, nDCG_T2V: 51.101, nDCG_AVG: 51.856,, mAP_V2T: 37.135, mAP_T2V: 30.797, mAP_AVG: 33.966
[mir_metrics]EpicKitchens_MIR epoch 5, nDCG_V2T: 49.996, nDCG_T2V: 49.994, nDCG_AVG: 49.995,, mAP_V2T: 36.629, mAP_T2V: 30.758, mAP_AVG: 33.694
[mir_metrics]EpicKitchens_MIR epoch 6, nDCG_V2T: 53.716, nDCG_T2V: 51.267, nDCG_AVG: 52.492,, mAP_V2T: 39.028, mAP_T2V: 31.126, mAP_AVG: 35.077
[mir_metrics]EpicKitchens_MIR epoch 7, nDCG_V2T: 51.784, nDCG_T2V: 50.258, nDCG_AVG: 51.021,, mAP_V2T: 37.062, mAP_T2V: 29.698, mAP_AVG: 33.380
[mir_metrics]EpicKitchens_MIR epoch 8, nDCG_V2T: 54.027, nDCG_T2V: 51.747, nDCG_AVG: 52.887,, mAP_V2T: 39.233, mAP_T2V: 31.393, mAP_AVG: 35.313
[mir_metrics]EpicKitchens_MIR epoch 9, nDCG_V2T: 54.211, nDCG_T2V: 51.660, nDCG_AVG: 52.935,, mAP_V2T: 40.166, mAP_T2V: 31.139, mAP_AVG: 35.653

TIA

HI, @thechargedneutron , thanks for your interest.

want to confirm do you use this JSON to reproduce the result?

MaxMarginRankingLoss fine-tuning objectives are crucial to reproduce the results and recommend a 0.2 margin.

I also use 8 GPUs on a single node to reproduce the result.

EgoClip_EPIC_16f_best_rel_01_margin_02.zip

I attach our training log and corresponding config here for your reference.

Thanks for the config file and the logs. I will try to have a similar run and observe the results. One question: what is the starting checkpoint for this experiment? Is it not this checkpoint? Should I start with a different checkpoint? The config file says /apdcephfs/private_qinghonglin/video_codebase/frozen-in-time-main/results/EgoClip_M_EgoNCE_N_V_Neg_Seg_60/models/0510_10/checkpoint-epoch1.pth.

Yes, there are the same checkpoint.
By checking the log, you can see:
2022-05-12 01:07:33,544 - trainer - INFO - [epic_kitchens]EpicKitchens epoch -1, nDCG_V2T: 23.810, nDCG_T2V: 21.854, nDCG_AVG: 22.832,, mAP_V2T: 19.037, mAP_T2V: 13.858, mAP_AVG: 16.448
which is the zero-shot MIR result with 16 frames.

@QinghongLin Thanks for providing me with the log file and the hyperparameters config file. I am now getting numbers very close to the baselines. Thank you for your help. I had one more question though: I saw that your training is quite fast -- One epoch takes ~5 minutes for you. Whereas it takes around an hour for me. Are you using SSD? Or is there anything else that I may be missing?

@thechargedneutron
Yes, I conduct experiments on a node with 8 A100 GPUs, with SSD.
Are you using the pre-extracted image frames as input (the same data loader I implemented)? and also suggest you check about the worker of the data loader (maybe too small).