dwyl/image-classifier

Epic: Image Classifier

nelsonic opened this issue ยท 9 comments

Once we are uploading images dwyl/imgup#51
We want to classify the images and suggest meta tags to describe the images so that they become "searchable".
That means pulling any text out of images using OCR.
And attempting to find any detail in images that can be useful.

We aren't going build our own models from scratch. but we are going to ...

Todo

  • Research the available models, services/APIs we can use to send an image that classify images

  • Research available OCR services or models.

    • If there is an Open Source OCR model we can run on our own infra e.g. โ‚ฌ20/month on Fly.io share!!
  • Images that are uploaded from a Camera or Smart Phone contain metadata including camera type/model, location (where the photo was taken), ISO, Shutter, Focal Length, Original Resolution, etc. We want to capture this and feed it into the classifier. #3

  • The objective of the classifier is to attempt to describe the image and return a few keywords.

  • If it makes more sense to have this as a standalone app (separate from imgup) then feel free to create a new repo! Then just send the data to the standalone app and receive JSON data in response. ๐Ÿ’ญ

@LuchoTurtle please leave comments with your research. ๐Ÿ™

Context

We want to be able to upload images in our App and have them become an item of content.
i.e. I take a photo of a messy kitchen and it becomes "Tidy The Kitchen" with a small thumbnail of the image.
If I tap on the thumbnail I see the full-screen. But the Text is the important part.

The reason we want to have a "Visual Todo List" is that it becomes easy for people who don't yet read (think toddlers) or people who don't read well (people who only have basic literacy) to follow instructions.

Stumbled upon these two, which might be relevant to revisit at a later stage:
https://github.com/bentoml/OpenLLM
https://github.com/showlab/Image2Paragraph

Yeah, saw OpenLLM on HN this morning:
openllm-top-hn
https://news.ycombinator.com/item?id=36388219
Looks good. BentoML is what OpenAI could have been but they chose to go closed (MSFT) ... ๐Ÿ™„

I've thought about what would be the best way of doing this and I've found a fair share of resources that I think may help get something close to what we want.

Image Captioning models

Most common open-source LLMs, such as Llama2 or Claude2, only receive text input. I took a gander at https://github.com/bentoml/OpenLLM, as I've stated in the comment above. However, it's not really useful to us as these LLms do not understand image inputs (though maybe some of these can understand vectorial representations of images). Therefore, we have to forgo these more "mainstream" LLMs for this use case.

There are, however, models pertaining to computer vision we can definitely use. I started my dive in https://github.com/salesforce/LAVIS#image-captioning, which led to me discovering BLIP-2, a zero-shot image-to-text generation model that we can use for image captioning.

I'm not going to explain how BLIP-2 works but you can find more info about it at https://huggingface.co/blog/blip-2. The good thing about it is that it's available in Hugging Face Transformers, which we can easily use to download and run BLIP-2 as a pre-trained model quite easily, even if it's just for testing purposes.

You can find a demo at https://huggingface.co/spaces/Salesforce/BLIP2.

Langchain ๐Ÿฆœ

I had heard about Langchain several times for a few months, and how it makes it easy to create LLM-based applications, and chain different models together to yield a given output for a person for whatever use case. And the fact that you can easily deploy it to fly.io is a big plus.

I was thinking of using BLIP-2 and chaining it to an open-source LLM like Llama 2 or the others, to get a more descriptive caption of the image, so we could extract keywords afterwards.

Image2Paragraph

However, I realised that I was doing something similar to Image2Paragraph, which does something similar to this, but with the added capabilities of two models: GRIT and Segment Anything, which provide contextual descriptions of images. The output of all three models (BLIP-2, GRIT, `Segment Anything) are later fed to an LLM (GPT, in this case) to generate a text paragraph describing the image.

Here's how the pipeline works:

image

So what to use?

You should give Image2Paragraph a whirl (I already tried on Hugging Faces but it's not working https://huggingface.co/spaces/Awiny/Image2Paragraph) but I don't see a clear way of using it to receive an image URL and output the paragraph and deploy this on fly.io. If I can only have this on localhost, there's no point in pursuing this.

So I wonder if only using BLIP-2 or using vit-gpt2-image-captioning models from HuggingFace is easier and more "doable" for what we want.

(The latter seems like a highly plausible option using transformers. See https://ankur3107.github.io/blogs/the-illustrated-image-captioning-using-transformers/).

Good research/summary. Thanks. ๐Ÿ‘Œ

As @nelsonic suggested, we can give https://github.com/elixir-image/image a whirl, as well.

@LuchoTurtle I've lowered the priority on this issue to reflect the fact that it's a very "nice to have" feature but isn't "core" to the experience of our App for the time being. We need to focus on the WYSIWYG editor and getting the "core" functionality done and then shipping the Flutter App to the App Store ASAP. โณ

Ref: dwyl/product-roadmap#40 we need to work on the Flutter App as our exclusive focus until we have feature parity with the Elixir/Phoenix MVP. I want to be using the Flutter App on my phone ASAP. ๐Ÿ™

Having said that, when you take "breaks" from the Flutter work and want to do research for image classifying, please do it. I know that AI/ML is an area of interest/focus for you so definitely research and capture what you learn. ๐Ÿ” ๐Ÿง‘โ€๐Ÿ’ป โœ๏ธ โœ…

It will be an awesome enhancement to add image recognition to the images people upload in the Flutter App.
But if we don't yet have a Flutter App deployed to the App Store dwyl/app#342 or Google Play dwyl/app#346 we are a "Default Dead" company.

@LuchoTurtle given that we are BLOCKED on both iOS App Store dwyl/app#342 (comment) and Google Play dwyl/app#346 both assigned to @iteles ๐Ÿ”ฅ
Please take a look at this issue today.
We should create a new repo for it: https://github.com/dwyl/image-classifier ๐Ÿ†• โœ…
Feel free to use Python for it if you think you can do it faster. ๐Ÿ
Otherwise if you can use Elixir, it will be easier for us to maintain longer-term. ๐Ÿ’ง