showlab/EgoVLP

About NLQ results.

takfate opened this issue · 6 comments

Hello.
Thanks for such nice work!
Now, we have some questions and want your help.
We use your EgoVLP_PT_BEST checkpoint to extract the video feature.
We train VSLNet with the feature and the bert checkpoint from EgoVLP_PT_BEST .
It Can't seem to get the precision you have in the report, and we only get about 7~8 R1@0.3.

Thanks,

We get similar results ~8 R1@0.3 based on the default settings,
and we further boost the performance based on some parameter tuning (e.g., learning rate, batch size).

image

I attached our config.json and log of best results in here model.zip, hope it helps you reproduce the results.

Please reach out if you have new updates.

Thanks for your response.
for feature extraction, does the model contain video proj (video_dim->256) and text proj (text_dim->256).
Are the channels of video feature and text feature 256?

Yes, during the feature extraction, the model contains video_proj and text_proj, and the channels of video and text features are 256.

Is the args.token True when extracting text feature?
We find the extracted text feature by default is 1x256.

Is the args.token True when extracting text feature? We find the extracted text feature by default is 1x256.

In our experiments, it seems that using Lx256 and using 1x256 have similar performance. But they are both weaker than using Lx768. Using Lx768 can obtain performance similar to your results, but still, have about a 0.4 gap.

@takfate

Hi, the NLQ results are implemented by my collaborator Mattia, I may misalign some details, I attach our VSLNet code implementation here so that you can refer to the relevant details.

NLQ.zip