mazzzystar/Queryable

think about image-to-text

nonacosa opened this issue ยท 4 comments

This project is amazing.I have an idea, what if we reverse the process and perform image-to-text analysis on a single image?For example:

a dog sitting next to a car

like this model: https://huggingface.co/nlpconnect/vit-gpt2-image-captioning

would that be feasible? I don't have much experience with large models and would like to ask for some design advice. Thank you :>

Does this project meet your requirements? https://github.com/apple/ml-stable-diffusion

Does this project meet your requirements? https://github.com/apple/ml-stable-diffusion

No, I understand that this project is text-to-image, but I'm interested in image-to-text. Given an image, I want to get a natural language explanation of the image, like this:

select an image ๐Ÿ‘‡๐Ÿป

image

Obtain the caption of the image ๐Ÿ‘‡๐Ÿป

a soccer player kicking a soccer ball

Can this be done using CLIP like in this project? I was inspired by this project and I want to complete this interesting project. So, I'm wondering if it's possible to implement this small feature using CLIP. I hope to get your valuable advice. Thank you :>

Oh, sorry, I misunderstood your intention. This could be done with another model called BLIP.

I'm afraid that I cannot add this feature to Queryable, as the BLIP model is too large. Currently, Queryable is around 300MB; it can't be any larger. However, I think this feature would be useful for the blind.

Thank you very much, your answer is very helpful to me.