Feat: Comparing Pre-trained Image Classification Models
nelsonic opened this issue Β· 9 comments
@LuchoTurtle as you've noted in the README.md > What about other models?
section:
The bigger the model the more resources consumed and slower the result ... π° β³
This is your opportunity to do some actual Software Engineering and write-up the findings!
Todo
-
Create a new file
model-comparison.md
-
Give a brief intro to why someone would want to use a smaller/bigger model (e.g. for a low-stakes demo)
-
Deploy all three models:
ResNet-50
BLIP
(base)BLIP
large.
-
Compare the response time and classification string for the same input images.
- Pick 7 sample images. e.g. a Pet, Vehicle, Food, Person, Landscape, Wild Animal, Random (Your Choice).
-
Create a Summary Table with the columns:
- Model Name
- Size in
GB
RAM
required to run it- Approximate monthly cost on Fly.io (other platforms may be cheaper/more expensive; we're just using Fly.io for illustration purposes...)
- Machine Startup-time from "paused" (cold boot) (just to show the initial page)
- Approximate response time for an image. (put the average in the summary table)
- Sample classification string
-
Create a Detail Table with the columns:
- Model Name
- Image thumbnail
- Image Description returned by the classifier/model
- Response time in
ms
Each row in the Detail table should be an entry for a given model.
Cluster the results together e.g. the Cat/Kitten pick for each model should be together to facilitate comparison.
Note
Looks like https://huggingface.co/Salesforce/blip-image-captioning-base/tree/main was last updated 11 months ago
...
Can we try https://github.com/salesforce/LAVIS ? π‘
@LuchoTurtle you asked on Standup if we should compare "just" these 3 models.
I think a "small", "medium" and "large" is a good starting point.
But if we get feedback from people on HN (once you post the link π) that they want more models compared,
then more can easily be added.
While it's true that there isn't a de facto leaderboard for image captioning (part of computer vision) tasks like MTEB
, there's a reason for it.
From what I've seen, the most regarded benchmark comparison there that puts different models side to side is https://paperswithcode.com/sota/image-classification-on-imagenet
It doesn't, however, have multimodal models (models that can receive multiple types of input), which BLIP
is. I can try to get a small benchmark going but I'm afraid I don't know how I can make it "data sciency" and compare accuracy between the models you've suggested.
There are already tools that compare different one-shot models, like https://huggingface.co/spaces/nielsr/comparing-captioning-models.
What I'm thinking is :
- either perhaps comparing the absolute value between embeddings of a dataset
ImageNet
image and what the model likeBLIP
yields to check for accuracy. - checking https://huggingface.co/docs/transformers/v4.35.0/en/tasks/image_captioning#evaluate.
I'll see to it π
The only thing we want is a real-world comparison.
i.e. We wanted to use an existing model to classify images.
We compared these 3 models along 3 dimensions: Quality, Speed & Cost.
This is way more interesting to a decision maker than the synthetic benchmark/leaderboard.
The Massive Text Embedding Benchmark (MTEB) Leaderboard is interesting for Embeddings ...
But your average person has no clue what all the columns in the tables mean.
Is a bigger number better or worse? in some cases the "best" model has a worse score than others.
How is the ranking calculated?
Anyway, we just want to compare the models that are available to us for the purposes of classifying images.
The table will be useful to us and interesting to several thousand other people on HN. π
@nelsonic do you mean to load or to get a description?
If it takes time to load, it's probably because the machine was "asleep" and you had to boot it again/"wake it" (because we've set machine instances to sleep after a period of inactivity to save costs). This is very much normal. I've inclusively just opened the link and it loaded instantly.
If there's a problem with the time to load the app from a machine that's asleep, that's another issue entirely. Even then, by caching the models, it takes seconds tops, instead of minutes that would have to wasted by re-downloading the models on every app startup.
Yeah, agree that Fly.io machine wake time is a separate issue that isn't really under our control.
You've done a good job of caching the model. π
We just need to trigger the "wake from sleep" when someone views the README.md
as noted in: #11
Meanwhile the descriptions are much better!
I suppose you know you can set min_machines_running = 1
in the "fly.toml", depends if you want this.
Yeah, when we βproductioniseβ this feature, we will set it to be βalways onβ (min=1) but for now we just want to focus on cold startup time. π