Question about Semantic Augmentation

Question

Question about Semantic Augmentation

luoluo123123123123 opened this issue 7 months ago · 7 comments

luoluo123123123123 commented 7 months ago

I‘m confused that,if we can consider
as target image embedding，why we need to train Aj？

why not just add

to Training Step？
by the way，How much memory is needed to run this experiment in your setting？I would greatly appreciate any assistance you can provide.

Answer 1 · 2024-04-15T08:07:28.000Z

Hello,
The goal of $A_j$ is to meaningfully translate the feature embeddings. We wanted all the trainable blocks like RPN , final bbox Classifier, and regressors to use these translated features.

We used 1 A100 GPU with 40GB VRAM in all the experiments.

Answer 2 · 2024-04-15T11:04:26.000Z

That means we can not do augmentation like this?

Answer 3 · 2024-04-15T11:10:39.000Z

I would like to know, what is the difference between directly subtracting the words 'rain' and 'sunny' as augmentation, and translating them into Aj? I appreciate your assistance on this matter.

Answer 4 · 2024-04-17T08:55:40.000Z

so the output of $\mathcal{V}^b$ corresponds to the final 512-dimensional clip embedding, $\mathcal{V}^a$ is hxwx1024 matrix, hence it is not straightforward to map to clip embedding space there

Answer 5 · 2024-04-22T13:16:36.000Z

Thank you for your explanation; I completely understand now! One last question ：If I use the Vit-B/16 pre-trained CLIP model as the text encoder and the ImageNet RN101 pre-trained model as the image encoder, would this approach be effective? Is this method only applicable when the source of the image encoder and text encoder is consistent?

Answer 6 · 2024-04-22T17:19:02.000Z

From my understanding underlying Clip text encoder architecture i.e GPT2 is the same for ViTb16 or RN101, which are image encoders. In any case, I think consistency might be needed

Answer 7 · 2024-04-23T02:53:03.000Z

Thank you for the detailed explanation. Your insights are very helpful and enlightening!