think about image-to-text

Question

think about image-to-text

nonacosa opened this issue a year ago · 4 comments

This project is amazing.I have an idea, what if we reverse the process and perform image-to-text analysis on a single image?For example:

a dog sitting next to a car

like this model: https://huggingface.co/nlpconnect/vit-gpt2-image-captioning

would that be feasible? I don't have much experience with large models and would like to ask for some design advice. Thank you :>

Answer 1 · 2023-11-08T10:34:12.000Z

Does this project meet your requirements? https://github.com/apple/ml-stable-diffusion

Answer 2 · 2023-11-08T10:48:47.000Z

Does this project meet your requirements? https://github.com/apple/ml-stable-diffusion

No, I understand that this project is text-to-image, but I'm interested in image-to-text. Given an image, I want to get a natural language explanation of the image, like this:

select an image 👇🏻

Obtain the caption of the image 👇🏻

a soccer player kicking a soccer ball

Can this be done using CLIP like in this project? I was inspired by this project and I want to complete this interesting project. So, I'm wondering if it's possible to implement this small feature using CLIP. I hope to get your valuable advice. Thank you :>

Answer 3 · 2023-11-08T10:53:44.000Z

Oh, sorry, I misunderstood your intention. This could be done with another model called BLIP.

I'm afraid that I cannot add this feature to Queryable, as the BLIP model is too large. Currently, Queryable is around 300MB; it can't be any larger. However, I think this feature would be useful for the blind.

Answer 4 · 2023-11-08T12:46:38.000Z

Thank you very much, your answer is very helpful to me.