About the choice of visual models
zhiyuanyou opened this issue · 3 comments
Hi~
Thanks for your great work!
I have read your paper and went through in detail this script (https://github.com/OpenGVLab/InternGPT/blob/main/iGPT/controllers/ConversationBot.py).
I noticed that the used visual models are determined by some key words, i.e., remove & erase means LDMInpainting, describe & introduce means HuskyVQA. This is a direct and effective way.
However, I wonder what will happen if the user does not input such words. For example, the user could input take out some objects
instead of remove some objects
for object removing.
Thanks in advance.
Thank you for your attention!
Your idea is quite interesting. To be frank, we are working on this issue. However, natural language is usually ambiguous. For example, take out some objects
could mean that you want to remove the object from the image, or that others want to extract the region of object in the image. Interestingly, the region of object
is also ambiguous, it can be a bounding box obtained by a detector or a mask calculated by a segmentor. Therefore, as you can see, this problem is very intractable.
Good News: we will release InternGPT v2 as soon as possible. We believe the upgraded version can partly address your issue.
Please stay tuned for our work. 🍻🍻🍻🍻🍻🍻
Thanks for your response.
I understand that if we give all choices to GPT for the determination of visual models, the processing results will be somewhat unsatisfying.
Currently, it is still a tradeoff between generalization and quality.
Yep! I can not agree with you more. The next version of iGPT can address this issue.