Using Stable diffusion as a photo editor

Motivation

Hey!

The original motivation behind doing this work was to make a little prank on whoever got my PhD position's application, so if you're here because of that, I hope you didn't mind it, enjoyed it and have fun reading this. And if you're just here because you found this repository in the internet sea, welcome too!

First of all, let me explain what the prank was. As I was planning to apply for PhD positions, I realized that a lot of researchers look for "creative and motivated" people to joing their teams, so I asked myself how can I show I have those qualities and at the same time give a little taste of my abilities. With this in mind I had the idea of using a generated image as my cv picture and just mention at the end of it that the person in the photo was not me, that it had been artificially produced and that if they were interested in how I did it, they can check my github out.

The problem was I didn't have the time or resources to make a new generative model from scratch but more importantly, that would not have been creative at all. We all have heard about Midjourney or played with Stable diffusion, so asking for a PhD position because I know how to search something in the internet and input some text in an already trained network which wasn't mine seemed a bit unlikely to work, to say the least.

However, there had been some pieces of work in which people used generative networks to modify artificial images, such as in FamilyGan where they used StyleGan to make predictions about how a couple's child would look like. I started reading about the topic and I was very surprised to find out people were finding vectors in the network's latent space which were allowing them to do all kind of stuff with generated faces like change their age, physiology or how big their smile is.

And that's how I had the idea for what I would do: I would take a picture of me and use one of this text-to-image generative models to see how I could modify it so it would look like a photo I can put on my cv. It's worth mentioning it was never my intention to make the resulting image look like me because if you judge the result by that, you will get disappointed and at the end of the day if I wanted that, I could just use my cellphone's camera. No, what I actually wanted to do was to see if I could change the purpose of this networks and use it to, given that I input a picture, change it however I like.

So without further ado, let's get to it!

Methodology

Choosing the network I would work with was an easy pick since keras already has a page dedicated to Stable Diffusion where it's shown how the conditional diffusion done with the prompts has a vector-space nature in the sense that, given that you use the same Gaussian distributed noise patch and you enter two prompts,you can interpolate their encodded representation and the resulting images would be interpolated versions of your prompts themselves, which is good news because then I knew I will be working in a latent space where its vector space properties are more or less consistent.

That's great but, to which extent does this work? I mean, in their page Keras show this very cute example in which a Golden Retriever becomes a bowl of fruit, so if I call the embedded versions of this prompts $p_1$ and $p_2$ respectively, then the interpolation process would be nothing but $(1-t)p_1 + tp_2$ with $t$ going from 0 to 1. After rearranging, we have that we can understand the interpolation as $p_1 + t(p_2-p_1)$ which looks nice because if $p_2-p1$ has some meaning as interpolator vector, it would mean that other dogs can be turned into bowls of fruit just by adding that vector to their representation, maybe even some other animals.

Testing this idea out I took as my new prompts:

prompt_3 = "A watercolor painting of a Grayhound at the beach"
prompt_4 = "A Golden Retriever at the beach"
prompt_5 = "A watercolor painting of a cat at the beach"

from which I got the following results using the interpolator vector I got before.

prompt's image	prompt plus interpolator vector	prompt's image	prompt plus interpolator vector
prompt 1	prompt 1 + interpolator	prompt 3	prompt 3 + interpolator
prompt 4	prompt 4 + interpolator	prompt 5	prompt 5 + interpolator

It can be appreciated how despite still getting reasonable outputs, they have more modifications than expected since the bowl of fruit changed its background and content and in one case, the fruit even got wormholes. Nonetheless, this suggests interpolators can be generalized, but I would need to change the methodology in order to get what I'm looking for.

Switching to humans

Now that I have a bit more idea of what I can do with this (although I haven't actually talked about modifiying a given picture), I questioned myself which was the best way to obtain this interpolator vectors for, namely, make someone look older. It was clear to me that expecting a single instace to be a good generalization I could use in a broader variety of cases was not going to work and that I had to incorporate the stochastic nature of it.

So let's assume $P$ and $I$ are the distributions of prompts discribing individual persons and of images showing a person in any situation respectively, $f$ is the text encoder and $\Omega$ is Stable diffusion's latent space. Then, what I'm looking for is a vector $\omega \in \Omega$ such that if $\pi_0$ is a prompt of any given person in some situation when they were $t_0$ years old and $\pi_1$ is a prompt of the exact same person in the exact same situation being $t_1$ years old, then $f(\pi_0) + (t_1-t_0)\omega = f(\pi_1)$ and thus $I(f(\pi_0) + (t_1-t_0)\omega)=I(f(\pi_1))$.

Hence, practically speaking and keeping the notation, one way to go further is to get a batch of prompts $x\sim P$ such that the context remains the same, but what changes is the age of the person and apply PCA on $f[x]$, from which I will obtain $\omega$ as the principal component. I tried to implement this idea with prompts literally just discribing a "person", but Stable diffusion proved to be not as good at generating good images of them, so taking it as a sign of the latent space not being as nice around that word, I created prompts for "woman" and "man", using the former to obtain $\omega$ and testing it on both sets.

20 years old (Original prompt)	35 years old	50 years old	65 years old	80 years old

At this point you may be asking a very valid question: why don't generate an ordered batch in which the a person of age $a_1$ is in different situations and another batch with the same order in which the person has another age $a_2$ in the same situations as before, obtain the parwise difference between corresponding elements and take the interpolation vector as its mean? Well, I have two answers for that. The first reason is because of PCA's ability to deal with noise removal better. You can see below the outcome of applying the previously pca-obtained interpolation vector to the promp "A happy 20 years old man" vs. the one obtained by getting the mean.

20 years old PCA	35 years old PCA	50 years old PCA	65 years old PCA	80 years old PCA

20 years old difference vector	35 years old PCA difference vector	50 years old PCA difference vector	65 years old PCA difference vector	80 years old PCA difference vector

It can be appreciated how in both cases the person aged, but that in the mean version the person changed the shirt's position, got a hat or a wig, got glasses and then lost them and changed the background, whereas for the one generated in the pca case, they kept the glasses once they got them, the shirt remained the same and the only change in the background was a door popping up.

The second reason I can offer to use pca over averaging is because we can obtain several principal components and this will come on handy later.

On another note, this last prompts were very easy to generate becase age is a quality we describe with a number, but what about the cases in which the feature isnot expressed this way? For this I generated promp batches, this time of men and women wearing shirts of different colors, again obtaining my interpolator vector from the women dataset and testing it in men's and in a new batch of women wearing skirts.

It may be worth mentioning that even if colors can be represented as numbers, we are working with a network trained on natural language and I cannot recall the last time I heard somebody praising the 570 to 590 nm wavelength in van Gogh's paintings or asking for a B60017 apple and thus the not only discrete nature of the colors but also the impossibility of embedding them continuously into $\mathbb{N}$.

blue skirt	1st color change	2nd color change	3rd color change	gray skirt

But there may be even harder cases in which we are working with a quality with only few classes, for example having short, medium or long hair or wearing glasses or not. Since I'm using PCA, applying it to a three prompts dataset ("person short/medium/long hair") doesn't make too much sense, but I've mentioned that obtaining averaging creates a lot of undesired changes on our images. Because of this and since I can obtain more than just one principal component, I created prompts in which I was not only changing the description of the hair length but also on the place the person was, which gave me the vector to create the following images.

short hair	long hair	even longer hair

short hair	medium hair	long hair

Before I go on, let me say something about an extra meaning we can assign to the principal component vector I'm using. Since I'm getting the principal component $v$ and using it to move around the latent space, I could build a function $\phi:\Omega\rightarrow\mathbb{R}$ such that $\phi(x) = x\cdot v$. Thus, if $c\in\mathbb{R}$ is parametrizing the movement on the direction $v$, $\phi(x+cv) = x\cdot v + c||v||^2 = x\cdot v+c$ since $v$ is normalized. Hence, we can understand $c$ as the amount of units we'll be moving in the desired direction, i.e. how many years the person will age, how many color units the clothing will change or how many length units the hair will grow. It also give us some insight into the way Stable diffusion understands the properties we are talking about. For instance it shows how it sees age and colors as a continuium, but hair length as two clusters.

$\phi$ for age	$\phi$ for color in woman's clothing vs. color in men's clothing	$\phi$ for hair length woman vs. man

Just to prove this was not only something from the human-related encodings, I tested the same technique to make the hours or months pass in a Central Park's image. The method proved to be successful and I generated some gifs to show how it not only distinguishes hours from months but also how smooth the transition is.

Hours passing by at Central Park	Months passing by at Central Park

Modifying a picture

Now that I have a better understanding on how the latent space works, I can finally tackle the problem on how to modify a photo I input and not just generate them.

As we all know, Stable diffusion begins with a random noise patch Gaussianly distributed and it's through several loops in which the text encoding works as a conditional diffusion that we get the image we're looking for. Knowing this, the first thing we would be tempted to do is to take our input photo, encode it and use it as the noise patch, but sadly this won't work because the network is expecting the patch to have a Gaussian distribution, which is very different form the distribution an usual picture would have.

Thus, we want to build a patch out of our input such that its distribution is similar to that of a $\mathcal{N}(0,1)$, but which still has some information of the original photo, so we are not just getting whatever randomly generated image. For doing this, I computed the intensity distribution $\varphi$ from my input photo and the solution $\gamma\in\Pi(\varphi,\mathcal{N}(0,1))$ from the Optimal Transport problem (in the Kantorovich version) between it and $\mathcal{N}(0,1)$.

By definition, we have that if $\Pi_z:\Omega\times\Omega\rightarrow\Omega$ is the projection into the $z$ coordinate, then $\Pi_x\star\gamma = \varphi$ and $\Pi_y\star\gamma = \mathcal{N}(0,1)$ where $\star$ is the pushforward operator. Thus, I interpolated between my picture and a Gaussian distribution using the optimal map, getting a transition as the one you can see in the plots below.

Photo distribution	1st interpolation	2nd interpolation	3rd interpolation	4th interpolation	5th interpolation

6th interpolation	7st interpolation	8nd interpolation	9rd interpolation	10th interpolation	Gaussian distribution

I chose empirically an interpolation step which would produce good images, meaning that Stable diffusion was understanding it as a valid initial patch while still have some information about my original distribution.

Editing my photo

Finally I got to the point in which I edit my own picture. What I will do is take one photo of me as a child, send it to the latent space, make the interpolation for its distribution to be closer to a Gaussian one, compute vectors to make me older, to wear glasses and to make my skin darker (this last one because as you may have noticed in the examples above, this network has the tendency to generate people of lighter skin) and apply them in the patch I obtained to get a photo to put on my cv. As I mentioned before, the goal of this work ain't producing an avatar identical to me, but to proof I can repurpose this architectures to modify my photo in order for it to still be a realisitc picture someone would put on their cv.

One photo I liked, not just because of the content but because my face appears clearly, was one they took at the hospital when I sister was born. You can see the original photo and the copping I kept below.

Original photo	Cropped photo

Then, having "A little boy of 5 years old at the park posing for a photo for his cv" as prompt, I input my picture into the network to generate a Stable-diffusion-version of me in the park, from which I got the following examples.

Original photo	Example 1	Example 2	Example 3

Realizing those kids actually look like me as a child, I applied the vectors as explained before and this is what I got.

Input photo	A photo of me at a park	Generated photo

Last but not least, if you want to see the trajectory from the generated child to the final photo, you have the next gif.

Evolution of my avatar