Question about the architecture analysis of the paper
Closed this issue · 1 comments
zhliuworks commented
Hello, thank you for your excellent work! I have a question, since FatFormer is built based on the CLIP framework, what text encoder is used in Table 4 for the Swin transformer and MAE. If the CLIP text encoder is used here, the pre-trained features of images and text prompts are not initially aligned.

Michel-liu commented
Hi~ thank you for your interest in our work. For Swin and pure Vit models, we simply add the proposed FAA module without any extra text branch, as this table aims to verify the concept of forgery adaptation in other architectures and pre-training methods. However, the idea of integrating unaligned text information into these vision models is interesting and worth trying.