How to generate image from Image+Text?

Hi.
Thanks for the great work you have provided.
In the readme I saw that there are several supported tasks:

Audio to Image
Audio+Text to Image
Audio+Image to Image
Image to Image
Text to Image
Thermal to Image
Depth to Image: Coming soon.

I am new to this type of applications, so I was wondering if it is possible to generate and image from image +text? For example, given an image of a dog and the text "pink flowers" I would like to generate an image that contains a dog and pink flowers.
If so, could you provide the code for an example? I was looking at the code in the api.py and I am a bit confused of the use of the prompt and text. Moreover, do I need to normalize the embeddings of the image and text before summing them together, or should I need to normalize the summed embedding?

I greatly appreciate your help.
Thanks.

I don't have time to implement it now, you could refer to

Anything2Image/anything2image/api.py

Line 76 in 681958d

elif audio is not None and text is not None:

to implement by yourself. The normalization has already handled. In a nutshell, the text and image should not be normalized. The audio should.

The stable-diffusion-unclip we used take two condition, (1) prompt (2) clip image embedding.

When we replace the clip image embedding with imagebind embedding, we could achieve anything2image.

The prompt in api.py refer to the prompt mentioned before. The text refer to the text imagebind embedding, which will replace the image embedding and feed into the diffusion model.

Thanks!

Sorry for bothering you again.
I was going through the original imagebind code and it looks like the image embeddings are normalized to l2:
https://github.com/facebookresearch/ImageBind/blob/38a9132636f6ca2acdd6bb3d3c10be5859488f59/models/imagebind_model.py#L421

modality_postprocessors[ModalityType.VISION] = Normalize(dim=-1)

but not temperature scaled.
Is there a reason why you skip normalization in your implementation?

        if image is not None:
            Image.fromarray(image).save('tmp.png')
            embeddings = model.forward({
                imagebind.ModalityType.VISION: imagebind.load_and_transform_vision_data(['tmp.png'], device),
            }, normalize=False)
            image_embeddings = embeddings[imagebind.ModalityType.VISION]
            os.remove('tmp.png')

Thank you for your time!

It is obtained via test and trial. I didn't dive into the theory too much due to the limitation of time.

Oh, I see.
Thanks.