Michel-liu/FatFormer

Question about the architecture analysis of the paper

Closed this issue · 1 comments

Hello, thank you for your excellent work! I have a question, since FatFormer is built based on the CLIP framework, what text encoder is used in Table 4 for the Swin transformer and MAE. If the CLIP text encoder is used here, the pre-trained features of images and text prompts are not initially aligned.

image

Hi~ thank you for your interest in our work. For Swin and pure Vit models, we simply add the proposed FAA module without any extra text branch, as this table aims to verify the concept of forgery adaptation in other architectures and pre-training methods. However, the idea of integrating unaligned text information into these vision models is interesting and worth trying.