vidit09/domaingen

Question about Semantic Augmentation

luoluo123123123123 opened this issue · 7 comments

I‘m confused that,if we can consider
image as target image embedding,why we need to train Aj?

why not just add
image
to Training Step?
by the way,How much memory is needed to run this experiment in your setting?I would greatly appreciate any assistance you can provide.

Hello,
The goal of $A_j$ is to meaningfully translate the feature embeddings. We wanted all the trainable blocks like RPN , final bbox Classifier, and regressors to use these translated features.

We used 1 A100 GPU with 40GB VRAM in all the experiments.

image
That means we can not do augmentation like this?

I would like to know, what is the difference between directly subtracting the words 'rain' and 'sunny' as augmentation, and translating them into Aj? I appreciate your assistance on this matter.

so the output of $\mathcal{V}^b$ corresponds to the final 512-dimensional clip embedding, $\mathcal{V}^a$ is hxwx1024 matrix, hence it is not straightforward to map to clip embedding space there

Thank you for your explanation; I completely understand now! One last question :If I use the Vit-B/16 pre-trained CLIP model as the text encoder and the ImageNet RN101 pre-trained model as the image encoder, would this approach be effective? Is this method only applicable when the source of the image encoder and text encoder is consistent?

From my understanding underlying Clip text encoder architecture i.e GPT2 is the same for ViTb16 or RN101, which are image encoders. In any case, I think consistency might be needed

Thank you for the detailed explanation. Your insights are very helpful and enlightening!