gwang-kim/DiffusionCLIP

Text editing in non-isolated images

tsaxena opened this issue · 1 comments

Hi,
Thanks for your work. I am trying the pretrained models on a few test images to see what the results look like. I was trying out the tennis_baseball_t500.pth to see how it works. It works well when the tennisball is well isolated but not so much when the object is part of a scene. When we fine tune the model, the paper says I need 30 or so images, were these images well isolated. If I replace it with images where tennis ball is a small part of the image, will the performance improve?

Hi, @tsaxena, thanks for your interests.
Yes, I think if we fine-tune the model with more images including images where the tennis ball is a small part of the image as well as isolated images, it can generalize better even in the cases you mentioned. But in my opinion, there is a limitation in the localizing ability of CLIP image encoders, so the performance will be also limited.