Epic: Image Classifier
nelsonic opened this issue ยท 9 comments
Once we are uploading images dwyl/imgup#51
We want to classify the images and suggest meta tags to describe the images so that they become "searchable".
That means pulling any text out of images using OCR.
And attempting to find any detail in images that can be useful.
We aren't going build our own models from scratch. but we are going to ...
Todo
-
Research the available models, services/APIs we can use to send an image that classify images
-
Research available OCR services or models.
- If there is an Open Source OCR model we can run on our own infra e.g. โฌ20/month on Fly.io share!!
-
Images that are uploaded from a Camera or Smart Phone contain
metadata
includingcamera type
/model
,location
(where the photo was taken),ISO
,Shutter
,Focal Length
,Original Resolution
, etc. We want to capture this and feed it into the classifier. #3 -
The objective of the classifier is to attempt to describe the image and return a few keywords.
-
If it makes more sense to have this as a standalone
app
(separate fromimgup
) then feel free to create anew
repo! Then just send the data to the standalone app and receiveJSON
data in response. ๐ญ
@LuchoTurtle please leave comments with your research. ๐
Context
We want to be able to upload images in our App
and have them become an item
of content.
i.e. I take a photo of a messy kitchen and it becomes "Tidy The Kitchen" with a small thumbnail of the image.
If I tap on the thumbnail I see the full-screen. But the Text is the important part.
The reason we want to have a "Visual Todo List" is that it becomes easy for people who don't yet read (think toddlers) or people who don't read well (people who only have basic literacy) to follow instructions.
Stumbled upon these two, which might be relevant to revisit at a later stage:
https://github.com/bentoml/OpenLLM
https://github.com/showlab/Image2Paragraph
Yeah, saw OpenLLM
on HN this morning:
https://news.ycombinator.com/item?id=36388219
Looks good. BentoML
is what OpenAI
could have been but they chose to go closed (MSFT) ... ๐
I've thought about what would be the best way of doing this and I've found a fair share of resources that I think may help get something close to what we want.
Image Captioning models
Most common open-source LLMs, such as Llama2 or Claude2, only receive text input. I took a gander at https://github.com/bentoml/OpenLLM, as I've stated in the comment above. However, it's not really useful to us as these LLms do not understand image inputs (though maybe some of these can understand vectorial representations of images). Therefore, we have to forgo these more "mainstream" LLMs for this use case.
There are, however, models pertaining to computer vision we can definitely use. I started my dive in https://github.com/salesforce/LAVIS#image-captioning, which led to me discovering BLIP-2
, a zero-shot image-to-text generation model that we can use for image captioning.
I'm not going to explain how BLIP-2
works but you can find more info about it at https://huggingface.co/blog/blip-2. The good thing about it is that it's available in Hugging Face Transformers, which we can easily use to download and run BLIP-2
as a pre-trained model quite easily, even if it's just for testing purposes.
You can find a demo at https://huggingface.co/spaces/Salesforce/BLIP2.
Langchain ๐ฆ
I had heard about Langchain several times for a few months, and how it makes it easy to create LLM-based applications, and chain different models together to yield a given output for a person for whatever use case. And the fact that you can easily deploy it to fly.io
is a big plus.
I was thinking of using BLIP-2
and chaining it to an open-source LLM like Llama 2 or the others, to get a more descriptive caption of the image, so we could extract keywords afterwards.
Image2Paragraph
However, I realised that I was doing something similar to Image2Paragraph
, which does something similar to this, but with the added capabilities of two models: GRIT
and Segment Anything
, which provide contextual descriptions of images. The output of all three models (BLIP-2
, GRIT
, `Segment Anything) are later fed to an LLM (GPT, in this case) to generate a text paragraph describing the image.
Here's how the pipeline works:
So what to use?
You should give Image2Paragraph
a whirl (I already tried on Hugging Faces but it's not working https://huggingface.co/spaces/Awiny/Image2Paragraph) but I don't see a clear way of using it to receive an image URL and output the paragraph and deploy this on fly.io
. If I can only have this on localhost
, there's no point in pursuing this.
So I wonder if only using BLIP-2
or using vit-gpt2-image-captioning
models from HuggingFace is easier and more "doable" for what we want.
(The latter seems like a highly plausible option using transformers. See https://ankur3107.github.io/blogs/the-illustrated-image-captioning-using-transformers/).
Good research/summary. Thanks. ๐
As @nelsonic suggested, we can give https://github.com/elixir-image/image a whirl, as well.
@LuchoTurtle I've lowered the priority on this issue to reflect the fact that it's a very "nice to have" feature but isn't "core" to the experience of our App
for the time being. We need to focus on the WYSIWYG
editor and getting the "core" functionality done and then shipping the Flutter
App to the App Store
ASAP. โณ
Ref: dwyl/product-roadmap#40 we need to work on the Flutter
App as our exclusive focus until we have feature parity with the Elixir/Phoenix
MVP
. I want to be using the Flutter
App on my phone ASAP. ๐
Having said that, when you take "breaks" from the Flutter
work and want to do research for image classifying, please do it. I know that AI/ML is an area of interest/focus for you so definitely research and capture what you learn. ๐ ๐งโ๐ป โ๏ธ โ
It will be an awesome enhancement to add image recognition to the images people upload in the Flutter
App.
But if we don't yet have a Flutter
App deployed to the App Store
dwyl/app#342 or Google Play
dwyl/app#346 we are a "Default Dead" company.
@LuchoTurtle given that we are BLOCKED
on both iOS App Store
dwyl/app#342 (comment) and Google Play
dwyl/app#346 both assigned to @iteles ๐ฅ
Please take a look at this issue today.
We should create a new
repo for it: https://github.com/dwyl/image-classifier ๐ โ
Feel free to use Python
for it if you think you can do it faster. ๐
Otherwise if you can use Elixir
, it will be easier for us to maintain longer-term. ๐ง