jiawen-zhu/ViPT

The reproduced model fails to achieve the accuracy reported in the paper

Opened this issue · 0 comments

Thank you for your open-source code. I noticed that the model supports two types of backbones: vit_ce and vit. So, I downloaded the models provided by ostrack, vitb_256_mae_ce_32x4_ep300 and vitb_256_mae_32x4_ep300, and placed them in the ./pretrained directory (OSTrack_ce_ep0300.pth.tar and OSTrack_ep0300.pth.tar). By modifying the YAML file, I attempted four combinations of pre-trained weights and backbones in Lasher:
OSTrack_ep0300.pth+vit_base_patch16_224_prompt

vipt-deep_rgbt_os300+vit.log

OSTrack_ep0300.pth+vit_base_patch16_224_ce_prompt

vipt-deep_rgbt_os300+vitce.log

OSTrack_ce_ep0300.pth+vit_base_patch16_224_ce_prompt

vipt-deep_rgbt_os300ce+vitce.log

OSTrack_ce_ep0300.pth+vit_base_patch16_224_prompt

vipt-deep_rgbt_os300ce+vit.log

However, none of these combinations achieved the accuracy reported in the vipt-deep_rgbt.log that you provided. Could you please advise on what might be the issue with my reproduction?(4090*2)
By the way, in order to resolve the issue of not finding images during the reading process, I made the following changes.
微信图片_20240331224006