CLIP for image and text embeddings

Question

CLIP for image and text embeddings

txhno opened this issue 7 months ago · 8 comments

Link to the documentation pages (if available)

https://github.com/patrickjohncyh/fashion-clip
https://huggingface.co/patrickjohncyh/fashion-clip

How could the documentation be improved?

Its a finetune of CLIP trained on a 500K Fashion DS, I would like to use the Huggingface API to use it. Either that or their wrapper package.
If it can be done please let me know how, Thanks! :)

Answer 1 · 2024-06-05T09:19:10.000Z

Interesting, it seems to be CLIPModel architecture. @DevinTDHa can we do that currently, or should we put it on the roadmap?

Answer 2 · 2024-06-06T07:53:29.000Z

Seems like it should work, if the underlying model doesn't have any architectural changes.

I'll try it out and report back!

Answer 3 · 2024-06-06T08:42:06.000Z

The model works no problem in Spark NLP! Just follow this notebook to import the model properly:

https://github.com/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_CLIP.ipynb

If you change the model name to patrickjohncyh/fashion-clip it should work. Let me know if you have any other questions.

Answer 4 · 2024-06-07T04:23:26.000Z

@DevinTDHa @maziyarpanahi Thanks a lot! I will definitely try it out. I'll contact you if something pops up. :)

Answer 5 · 2024-06-11T05:09:14.000Z

@DevinTDHa @maziyarpanahi I have a different question. Is it possible for me to use Spark NLP to compute CLIP embeddings instead of just using the ZeroShotClassification? My usecase is, taking a folder of images, and using Spark NLP to compute the embeddings of all the images in the folder and store them in a vector store, to later do retrieval tasks or similarity search.

Could I do something like this?

CLIP = (
CLIPForZeroShotClassification.loadSavedModel(f"{EXPORT_PATH}", spark)
.setInputCols("image_assembler")
.setOutputCol("embedding")
)
image_assembler = ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")

And could I do it without setting labels? CLIP.setCandidateLabels

Answer 6 · 2024-06-11T12:39:27.000Z

Hi @txhno,

Sadly this is currently not possible, we would have to create this as a new feature. I don't think it would take that much time as most of it is already implemented. @maziyarpanahi perhaps we could fit this into one of the next releases?

Answer 7 · 2024-06-11T13:01:15.000Z

Thanks @txhno and @DevinTDHa. In fact, the idea was always to continue with using CLIP to have an annotator to convert image to embeddings, and another one to covert text into embeddings.

We will add these into our roadmap and I change this into feature request ticket.

Answer 8 · 2024-12-09T00:28:16.000Z

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 5 days