About the choice of visual models

Question

About the choice of visual models

zhiyuanyou opened this issue a year ago · 3 comments

Hi~
Thanks for your great work!

I have read your paper and went through in detail this script (https://github.com/OpenGVLab/InternGPT/blob/main/iGPT/controllers/ConversationBot.py).

I noticed that the used visual models are determined by some key words, i.e., remove & erase means LDMInpainting, describe & introduce means HuskyVQA. This is a direct and effective way.

However, I wonder what will happen if the user does not input such words. For example, the user could input take out some objects instead of remove some objects for object removing.

Thanks in advance.

Answer 1 · 2023-07-25T07:25:15.000Z

Thank you for your attention!

Your idea is quite interesting. To be frank, we are working on this issue. However, natural language is usually ambiguous. For example, take out some objects could mean that you want to remove the object from the image, or that others want to extract the region of object in the image. Interestingly, the region of object is also ambiguous, it can be a bounding box obtained by a detector or a mask calculated by a segmentor. Therefore, as you can see, this problem is very intractable.

Good News: we will release InternGPT v2 as soon as possible. We believe the upgraded version can partly address your issue.

Please stay tuned for our work. 🍻🍻🍻🍻🍻🍻

Answer 2 · 2023-07-25T07:32:50.000Z

Thanks for your response.

I understand that if we give all choices to GPT for the determination of visual models, the processing results will be somewhat unsatisfying.

Currently, it is still a tradeoff between generalization and quality.

Answer 3 · 2023-07-25T07:37:23.000Z

Yep! I can not agree with you more. The next version of iGPT can address this issue.