think about image-to-text
nonacosa opened this issue ยท 4 comments
This project is amazing.I have an idea, what if we reverse the process and perform image-to-text analysis on a single image?For example:
a dog sitting next to a car
like this model: https://huggingface.co/nlpconnect/vit-gpt2-image-captioning
would that be feasible? I don't have much experience with large models and would like to ask for some design advice. Thank you :>
Does this project meet your requirements? https://github.com/apple/ml-stable-diffusion
Does this project meet your requirements? https://github.com/apple/ml-stable-diffusion
No, I understand that this project is text-to-image, but I'm interested in image-to-text. Given an image, I want to get a natural language explanation of the image, like this:
select an image ๐๐ป
Obtain the caption of the image ๐๐ป
a soccer player kicking a soccer ball
Can this be done using CLIP like in this project? I was inspired by this project and I want to complete this interesting project. So, I'm wondering if it's possible to implement this small feature using CLIP. I hope to get your valuable advice. Thank you :>
Oh, sorry, I misunderstood your intention. This could be done with another model called BLIP.
I'm afraid that I cannot add this feature to Queryable, as the BLIP model is too large. Currently, Queryable is around 300MB; it can't be any larger. However, I think this feature would be useful for the blind.
Thank you very much, your answer is very helpful to me.