Yui010206/SeViLA

The result of inference cannot be found in the paper

fake-warrior8 opened this issue · 5 comments

Hi, I reran the inference code: sh run_scripts/sevila/inference/nextqa_infer.sh, the result is {'agg_metrics': 0.649119295436349, 'total': 4996, 'DC': 60.32608695652174, 'CH': 64.62809917355372, 'CW': 63.91959798994975, 'TN': 57.279236276849645, 'TC': 65.26458616010855, 'DL': 89.84962406015038, 'DO': 73.74631268436578, 'TP': 51.35135135135135}. However, I cannot find acc 64.9 in your paper, so what setting does the results of nextqa_infer.sh used and why I cannot find an acc of 64.9 in the paper?

I used the NExT-QA videos and annotation from the original author's github, the preprocessing code you given, and the checkpoint you given.

Hello, the GPU that we utilize for our model is the A6000, what type of GPU you used? The model used in the script is built with a zero-shot answerer and a pre-trained localizer (corresponding to paper Table 2 NeXT-QA: 63.6%).

Hello, the GPU that we utilize for our model is the A6000, what type of GPU you used? The model used in the script is built with a zero-shot answerer and a pre-trained localizer (corresponding to paper Table 2 NeXT-QA: 63.6%).

I used A100. Thank you for your reply.

Hello, the GPU that we utilize for our model is the A6000, what type of GPU you used? The model used in the script is built with a zero-shot answerer and a pre-trained localizer (corresponding to paper Table 2 NeXT-QA: 63.6%).

I reran the fine-tuning code on NExT-QA, the result is {'agg_metrics': 0.8364691753402722, 'total': 4996, 'DC': 75.0, 'CW': 84.92462311557789, 'TN': 78.52028639618138, 'DL': 97.36842105263158, 'CH': 81.81818181818183, 'TC': 81.9538670284939, 'TP': 78.37837837837837, 'DO': 90.2654867256637}, which is much higher than the acc 73.8 reported in your paper..

I used the following script

python -m torch.distributed.run --nproc_per_node=4 --master_port 29503 train.py \
--cfg-path ./lavis/projects/sevila/train/nextqa.yaml \
--options run.output_dir=${result_dir}${exp_name} \
model.frame_num=4 \
datasets.nextqa.build_info.annotations.train.storage=${train_path} \
datasets.nextqa.build_info.annotations.val.storage=${val_path} \
datasets.nextqa.build_info.annotations.test.storage=${val_path} \
datasets.nextqa.build_info.videos.storage=${video_path} \
datasets.nextqa.vis_processor.train.n_frms=32 \
datasets.nextqa.vis_processor.eval.n_frms=32 \
run.batch_size_train=8 \
run.batch_size_eval=8 \
run.init_lr=3e-5 \
run.max_epoch=10 \
run.warmup_steps=1000 \
run.accum_grad_iters=2 \
model.task='qvh_freeze_loc_train_qa_with_loc_train_qa_vid' \
model.finetuned=${ckpt} \
run.task='videoqa'

The train/val meta file are processed by your script, the checkpoints used the one you given, the video are downloaded from NExT-QA .

Oh, I just checked the data preprocessing scripts and found the bug.
The NeXT-QA val is actually the NeXT-QA train in previous scripts. I have fixed this, and you may double-check it by confirming NeXT-QA val contains 4996 examples.
sorry for this bug, and thanks for pointing it out.
Could you please re-try it with the correct NeXT-QA val with the zero-shot setting? it should be similar to what our paper reported.

As a note, if you want to re-test the fine-tuning model, you should combine the saved checkpoints with missing keys in released checkpoints, since the LAVIS framework does not save the frozen model parts.