Metrics of ClipCap's Original Performance
chmorfop opened this issue · 2 comments
Hello,
thank you very much for your work.
In my experiments, I utilized the transformer mapping network with the default settings but I failed to reach the original metrics of the paper.
In more detail, in my experiments I used K=10 constant tokens, 10 prefix length, 8 multi head self attention layers with 8 heads each
, 10 epochs using a batch size of 40 and AdamW as an optimizer. The learning rate and warm up steps are default (2e^-5 , 5000).
The image encoder and the decoder are the default (ViT-B/32 and GPT2).
The mentioned metrics of the paper (with respect to the COCO dataset and the Transformer Mapping Network) are
( B4: 33.53% , METEOR: 27.45% , CIDEr: 113% )
in contrast to my metrics which are ( B4 : 71,72% , METEOR : 24.89% , CIDEr: 90.91%),
which are significant less than the original.
Lastly, I have to mention that the above experiment is trained on a single GPU and the validation is from the COCO dataset.
The evaluation metrics are calculated from the pycocoevalcap repository.
Any ideas on how to reach the original model's performance?