Pre-trained visual encoder, 4M pretraining?
m-bain opened this issue ยท 5 comments
HI thanks for this valuable work, its a good insight into fusing vision and text and impressive results :).
Regarding the initialisation of the visual encoder, pretrained CLIP is used for the visual encoder, trained on 400M images?
It's somewhat confusing because in the table it states 4M images used for #pretraining.
Thanks for your question!
The "4M" pretraining data means we pretrain our proposed Bridge-Tower in 4M data. We follow the common practice in VL pretraining field without including the pretrained visual backbone. Pls refer to METER paper.
But anyway, this is a good suggestion, as we will make it more clear in our next version.
Hi, I'm not sure that the METER paper provides any evidence in that regard. In their "Pretraining" section, they claim to only use Adam with these 4 datasets (<4M images), and there's no mention anywhere of any other kind of pretraining in their visual backbones.
Even if they did, it wouldn't be a good precedent to follow, as it would render all the statements about pretraining false (since more than 4M images were used before the training stage).
Hi, @m-bain and @jotaf98. Thanks for your suggestions!
-
About "Pre-trained visual encoder, 4M pretraining?": The statistics of the pre-training datasets are shown in METER Paper Appendix A Table 11 and Bridge-Tower Paper Appendix A.5 Table 9. It is a common practice to adopt the pre-trained visual and/or textual backbone and then continue to pre-train with the 4M images. Almost all the baselines (except SimVLM) in Table 4 of our paper adopt the pre-trained visual and/or textual backbone. Considering the pre-trained visual and/or textual backbones used by these baselines are various, for brevity, VL papers often do not show the data used by the pre-trained visual and/or textual backbone in the result tables. (You can check these details in VL papers.) But anyway, as confusing as it may seem, we will make it more clear in our next version.
-
About "METER Pre-training": In Section 4.5, METER adopts the pre-trained CLIP-ViT(or Swin Transformer) and RoBERTa as their visual/textual backbone, then pre-trains their model with the 4M data. As shown in Section 4.1(Ours), we also use the pre-trained CLIP-ViT and RoBERTa as backbones and then pre-train them with the same 4 datasets(actually, fewer data could be downloaded). The comparison between METER and Bridge-Tower is fair and proves bridges can help Vision-Language Representation Learning.
We hope the above statement can solve your confusion.
Thanks again for your questions and suggestions! We will make it more clear in the next version based on our discussion.
Just to respond to #1:
- ALBEF uses DieT visual backbone trained on imagenet
- VLMO [77] uses SSL image-only initialisation.
Some other works extract FRCNN region features, so yes the backbones are various.
But comparing to these works (and giving them all the same 4M column) and pretraining your visual encoder on 400M image-text pairs is not a fair comparison. It's possible your SOTA performance is due to CLIP backbone (you can see this from the big boost METER get from swin->clip), since you don't provide any ablation on this there is no way to tell. E.g. what would 4M ALBEF performance if they also use clip initialization?
A helpful fix would be to add a column on "visual backbone" like we did in our frozen in time paper.
Bridge-Tower and METER both use CLIP-ViT as the visual backbone, which means it is a fair comparison between them. The performance improvement compared to METER can prove the effectiveness of the introduced bridges. In addition, METER with the CLIP backbone does not achieve SOTA performance on all VL downstream tasks. For example, METER 570.7 vs. ALBEF 571.4 (RSUM on Flickr30K), can also prove that the CLIP backbone is not enough to achieve SOTA performance.
As stated in Section 3:
our goal is to build a bridge between each uni-modal encoder and the cross-modal encoder to enable comprehensive and detailed interaction at each layer of the cross-modal encoder. Our goal is not to develop new encoders; in principle, one can apply any visual, textual, or cross-modal encoder in the proposed Bridge-Tower architecture.
In a word, the key to our paper is not that we use ALBEF or VLMo or METER as our original architecture, nor is it what we used as our visual backbone. The key to our paper is the introduced bridges can improve performance on various VL downstream tasks.
Thanks for mentioning your paper "frozen in time". Table 4 is a helpful reference("Vis Enc. Init.", "Visual-Text PT" and "#pairs PT" columns). We also consider adding columns like "visual backbone", "textual backbone", etc.
Thanks again for the helpful discussion!