Question about Semantic Augmentation
luoluo123123123123 opened this issue · 7 comments
Hello,
The goal of
We used 1 A100 GPU with 40GB VRAM in all the experiments.
I would like to know, what is the difference between directly subtracting the words 'rain' and 'sunny' as augmentation, and translating them into Aj? I appreciate your assistance on this matter.
so the output of
Thank you for your explanation; I completely understand now! One last question :If I use the Vit-B/16 pre-trained CLIP model as the text encoder and the ImageNet RN101 pre-trained model as the image encoder, would this approach be effective? Is this method only applicable when the source of the image encoder and text encoder is consistent?
From my understanding underlying Clip text encoder architecture i.e GPT2 is the same for ViTb16 or RN101, which are image encoders. In any case, I think consistency might be needed
Thank you for the detailed explanation. Your insights are very helpful and enlightening!