The details of experiments look very solid. Has any one reproduced successfully?
Arsiuuu opened this issue · 5 comments
I tried some times, but got poor results.
+1
I only tried on MSR-VTT for video-text retrieval and got poor results too. @BinhuiXie
@BinhuiXie Hi there, because we have done a large number of ablation experiments and due to storage resource limitations, we are ashamed that we did not record all ckpt parameters completely, but we found the Flickr30K results of a version of ablation experiment at that time, and you can see that R@1 does not have a large drop. If R@1 drops too much (more than 10), we think it may be because there is no convergence.
@Arsiuuu hi there, we follow the same setting in Uniadapter, so you may check your pretraining .pth and proper dataset to be consistent with the Uniadapter, and since ITC loss uses a global queue, you should better use about the same batch size and nGPUs.
@xinlong-yang thank you for your response.
actually, the large drop was caused by the incorrect command.
original
python -m torch.distributed.run --nproc_per_node=8 train_retrieval.py --config ./configs/retrieval_flickr.yaml --output_dir output/flickr --evaluate
this will load the BLIP pre-trained parameters 🤣
and the fine-tuned parameters by Aurora could be loaded as follows.
correct
python -m torch.distributed.run --nproc_per_node=8 train_retrieval.py --config ./configs/retrieval_flickr.yaml --output_dir output/flickr --evaluate ----pretrained output/flickr/checkpoint_3.pth
thanks again! keep up the fantastic work 🚀