Hey!
The original motivation behind doing this work was to make a little prank on whoever got my PhD position's application, so if you're here because of that, I hope you didn't mind it, enjoyed it and have fun reading this. And if you're just here because you found this repository in the internet sea, welcome too!
First of all, let me explain what the prank was. As I was planning to apply for PhD positions, I realized that a lot of researchers look for "creative and motivated" people to joing their teams, so I asked myself how can I show I have those qualities and at the same time give a little taste of my abilities. With this in mind I had the idea of using a generated image as my cv picture and just mention at the end of it that the person in the photo was not me, that it had been artificially produced and that if they were interested in how I did it, they can check my github out.
The problem was I didn't have the time or resources to make a new generative model from scratch but more importantly, that would not have been creative at all. We all have heard about Midjourney or played with Stable diffusion, so asking for a PhD position because I know how to search something in the internet and input some text in an already trained network which wasn't mine seemed a bit unlikely to work, to say the least.
However, there had been some pieces of work in which people used generative networks to modify artificial images, such as in FamilyGan where they used StyleGan to make predictions about how a couple's child would look like. I started reading about the topic and I was very surprised to find out people were finding vectors in the network's latent space which were allowing them to do all kind of stuff with generated faces like change their age, physiology or how big their smile is.
And that's how I had the idea for what I would do: I would take a picture of me and use one of this text-to-image generative models to see how I could modify it so it would look like a photo I can put on my cv. It's worth mentioning it was never my intention to make the resulting image look like me because if you judge the result by that, you will get disappointed and at the end of the day if I wanted that, I could just use my cellphone's camera. No, what I actually wanted to do was to see if I could change the purpose of this networks and use it to, given that I input a picture, change it however I like.
So without further ado, let's get to it!
Choosing the network I would work with was an easy pick since keras already has a page dedicated to Stable Diffusion where it's shown how the conditional diffusion done with the prompts has a vector-space nature in the sense that, given that you use the same Gaussian distributed noise patch and you enter two prompts,you can interpolate their encodded representation and the resulting images would be interpolated versions of your prompts themselves, which is good news because then I knew I will be working in a latent space where its vector space properties are more or less consistent.
That's great but, to which extent does this work? I mean, in their page Keras show this very cute example in which a Golden Retriever becomes
a bowl of fruit, so if I call the embedded versions of this prompts
Testing this idea out I took as my new prompts:
- prompt_3 = "A watercolor painting of a Grayhound at the beach"
- prompt_4 = "A Golden Retriever at the beach"
- prompt_5 = "A watercolor painting of a cat at the beach"
from which I got the following results using the interpolator vector I got before.
It can be appreciated how despite still getting reasonable outputs, they have more modifications than expected since the bowl of fruit changed its background and content and in one case, the fruit even got wormholes. Nonetheless, this suggests interpolators can be generalized, but I would need to change the methodology in order to get what I'm looking for.
Now that I have a bit more idea of what I can do with this (although I haven't actually talked about modifiying a given picture), I questioned myself which was the best way to obtain this interpolator vectors for, namely, make someone look older. It was clear to me that expecting a single instace to be a good generalization I could use in a broader variety of cases was not going to work and that I had to incorporate the stochastic nature of it.
So let's assume
Hence, practically speaking and keeping the notation, one way to go further is to get a batch of prompts
20 years old (Original prompt) | 35 years old | 50 years old | 65 years old | 80 years old |
---|---|---|---|---|
At this point you may be asking a very valid question: why don't generate an ordered batch in which the a person of age
20 years old PCA | 35 years old PCA | 50 years old PCA | 65 years old PCA | 80 years old PCA |
---|---|---|---|---|
20 years old difference vector | 35 years old PCA difference vector | 50 years old PCA difference vector | 65 years old PCA difference vector | 80 years old PCA difference vector |
---|---|---|---|---|
It can be appreciated how in both cases the person aged, but that in the mean version the person changed the shirt's position, got a hat or a wig, got glasses and then lost them and changed the background, whereas for the one generated in the pca case, they kept the glasses once they got them, the shirt remained the same and the only change in the background was a door popping up.
The second reason I can offer to use pca over averaging is because we can obtain several principal components and this will come on handy later.
On another note, this last prompts were very easy to generate becase age is a quality we describe with a number, but what about the cases in which the feature isnot expressed this way? For this I generated promp batches, this time of men and women wearing shirts of different colors, again obtaining my interpolator vector from the women dataset and testing it in men's and in a new batch of women wearing skirts.
It may be worth mentioning that even if colors can be represented as numbers, we are working with a network trained on natural language and
I cannot recall the last time I heard somebody praising the 570 to 590 nm wavelength in van Gogh's paintings or asking for a B60017 apple and thus the not
only discrete nature of the colors but also the impossibility of embedding them continuously into
blue skirt | 1st color change | 2nd color change | 3rd color change | gray skirt |
---|---|---|---|---|
But there may be even harder cases in which we are working with a quality with only few classes, for example having short, medium or long hair or wearing glasses or not. Since I'm using PCA, applying it to a three prompts dataset ("person short/medium/long hair") doesn't make too much sense, but I've mentioned that obtaining averaging creates a lot of undesired changes on our images. Because of this and since I can obtain more than just one principal component, I created prompts in which I was not only changing the description of the hair length but also on the place the person was, which gave me the vector to create the following images.
short hair | long hair | even longer hair |
---|---|---|
short hair | medium hair | long hair |
---|---|---|
Before I go on, let me say something about an extra meaning we can assign to the principal component vector I'm using. Since I'm getting the principal component
|
|
|
---|---|---|
Just to prove this was not only something from the human-related encodings, I tested the same technique to make the hours or months pass in a Central Park's image. The method proved to be successful and I generated some gifs to show how it not only distinguishes hours from months but also how smooth the transition is.
Hours passing by at Central Park | Months passing by at Central Park |
---|---|
Now that I have a better understanding on how the latent space works, I can finally tackle the problem on how to modify a photo I input and not just generate them.
As we all know, Stable diffusion begins with a random noise patch Gaussianly distributed and it's through several loops in which the text encoding works as a conditional diffusion that we get the image we're looking for. Knowing this, the first thing we would be tempted to do is to take our input photo, encode it and use it as the noise patch, but sadly this won't work because the network is expecting the patch to have a Gaussian distribution, which is very different form the distribution an usual picture would have.
Thus, we want to build a patch out of our input such that its distribution is similar to that of a
By definition, we have that if
Photo distribution | 1st interpolation | 2nd interpolation | 3rd interpolation | 4th interpolation | 5th interpolation |
---|---|---|---|---|---|
6th interpolation | 7st interpolation | 8nd interpolation | 9rd interpolation | 10th interpolation | Gaussian distribution |
---|---|---|---|---|---|
I chose empirically an interpolation step which would produce good images, meaning that Stable diffusion was understanding it as a valid initial patch while still have some information about my original distribution.
Finally I got to the point in which I edit my own picture. What I will do is take one photo of me as a child, send it to the latent space, make the interpolation for its distribution to be closer to a Gaussian one, compute vectors to make me older, to wear glasses and to make my skin darker (this last one because as you may have noticed in the examples above, this network has the tendency to generate people of lighter skin) and apply them in the patch I obtained to get a photo to put on my cv. As I mentioned before, the goal of this work ain't producing an avatar identical to me, but to proof I can repurpose this architectures to modify my photo in order for it to still be a realisitc picture someone would put on their cv.
One photo I liked, not just because of the content but because my face appears clearly, was one they took at the hospital when I sister was born. You can see the original photo and the copping I kept below.
Original photo | Cropped photo |
---|---|
Then, having "A little boy of 5 years old at the park posing for a photo for his cv" as prompt, I input my picture into the network to generate a Stable-diffusion-version of me in the park, from which I got the following examples.
Original photo | Example 1 | Example 2 | Example 3 |
---|---|---|---|
Realizing those kids actually look like me as a child, I applied the vectors as explained before and this is what I got.
Input photo | A photo of me at a park | Generated photo |
---|---|---|
Last but not least, if you want to see the trajectory from the generated child to the final photo, you have the next gif.
Evolution of my avatar
I hope you liked it!