Image Editing
egeozsoy opened this issue · 13 comments
Do you have any suggestions for generating images by not only relying on text, but on a masked image, as open describes in their blog https://openai.com/dall-e-2/?
You need to train it with inpainting task. In particular, the Decoder Unet needs to be able to take in a mask input to be concatenated with masked image on the channel dimension to predict the original image. Right now I think this feature is not implemented in this repo.
I think this would be a nice addition at some point, if I do anything on this regard will let you know :)
It would be good to train an all in one model where the model inpaints as needed or also do full image generation by simply giving a full zero mask.
Agree, it would be a complementary task so doing both tasks at the same time should likely not hurt the overall performance.
Correction to the above, the masked image and mask would also need to be concatenated to x (aka the noised image).
So if one model is trained for both tasks at the same time, we would need to do noised_image + empty masked image + empty mask during normal training, noised_image + masked_original_image + mask during inpainting training.
Do we have to add noise to the entire image, or is it enough just to add noise to the masked part? Not exactly the most scientific resource but check the video on https://openai.com/dall-e-2/ timestamp 2:37 "monkey paying taxes". It seems like they input an image where only the masked part is noised.
During training, only x is noised or denoised and masked image and mask used directly from the various pieces of openai GLIDE and DDPM code if I understand it correctly.
Taken from the GLIDE paper: "Most previous work that uses diffusion models for inpaint- ing has not trained diffusion models explicitly for this task (Sohl-Dickstein et al., 2015; Song et al., 2020b; Meng et al., 2021). In particular, diffusion model inpainting can be performed by sampling from the diffusion model as usual, but replacing the known region of the image with a sample from q(xt|x0) after each sampling step."
So maybe a short-term solution could be to adapt the sampling logic to allow inpainting in this fashion?
yea, depending on some circumstances next week, i could build this, let's leave this open
It would be good to train an all in one model where the model inpaints as needed or also do full image generation by simply giving a full zero mask.
yup, this is the most ideal case :)
It would be good to train an all in one model where the model inpaints as needed or also do full image generation by simply giving a full zero mask.
yup, this is the most ideal case :)
Alternatively can you not just finetune the generation model for inpainting you simply would have just to change the input layer the rest of the weights in the network you should be able to take over.
i think i'm going to aim for integrating this technique https://github.com/andreas128/RePaint it is a pretty recent paper, but the results look good. can use this resampler technique for both dalle2 and imagen
ok it is done https://github.com/lucidrains/dalle2-pytorch#inpainting