rmokady/CLIP_prefix_caption

Metrics of ClipCap's Original Performance

chmorfop opened this issue · 2 comments

Hello,
thank you very much for your work.

In my experiments, I utilized the transformer mapping network with the default settings but I failed to reach the original metrics of the paper.

In more detail, in my experiments I used K=10 constant tokens, 10 prefix length, 8 multi head self attention layers with 8 heads each
, 10 epochs using a batch size of 40 and AdamW as an optimizer. The learning rate and warm up steps are default (2e^-5 , 5000).
The image encoder and the decoder are the default (ViT-B/32 and GPT2).

The mentioned metrics of the paper (with respect to the COCO dataset and the Transformer Mapping Network) are
( B4: 33.53% , METEOR: 27.45% , CIDEr: 113% )
in contrast to my metrics which are ( B4 : 71,72% , METEOR : 24.89% , CIDEr: 90.91%),
which are significant less than the original.

Lastly, I have to mention that the above experiment is trained on a single GPU and the validation is from the COCO dataset.
The evaluation metrics are calculated from the pycocoevalcap repository.

Any ideas on how to reach the original model's performance?