New alchemy forms - clip image feature extraction, clip text encode
Opened this issue · 2 comments
There are use cases for being able to do client-side manipulation of the various intermediate results of the clip interrogation process.
To compare an image to text via CLIP, the following happens:
- The text is encoded into features.
open_clip
usesclip_model.encode_text(text_tokens)
. This returns atensor
. - The image "features" are extracted by using the CLIP model.
open_clip
usesclip_model.encode_image(...)
. This returns atensor
. - The tensors are normalized.
- The image features and the text features are compared.
- A similarity score is assigned and returned.
This feature request would allow steps 1 + 2 to be returned independently, optionally as part of a regular interrogate request, or separately on their own without the need to load a CLIP model locally - they could perform the math pertinent to their use case in slow/limited RAM environments. Certain types of image-searching/database schemes could benefit from this.
I propose the following forms be added:
-
encode_text
- Accepts a list of strings and the value of a supported CLIP model.
- For each string returns a
.safetensors
file containing the encoded text tensor and which model was used to encode it.
-
encode_image
- Accepts a
source_image
and the value of a supported CLIP model. - Returns a
.safetensors
file containing the encoded image features and which model was used to encode it.
- Accepts a
This proposal has the obvious wrinkle of needing to support the upload of .safetensors
files. The size of these files is on the order of magnitude of single-digit kilobytes.
Related to Haidra-Org/horde-worker-reGen#9.
A useful feature might be to opt into including the resulting image embeddings with an image generation request.
I.e. in the /generate/status/ endpoint, each generation result would include an r2 url containing that image’s calculated embedding safetensor file.
That being said, it’s easily avoidable by just doing the alchemy request separately, and I imagine this request would be more difficult to set up.
I think we might avoid using R2 here and just b64 the safetensors in the DB. couple-kb data per file shouldn't be a terrible amount and if bandwidth starts being choked due to these I can always switch to R2 later.