What is Stable Diffusion?

Stable diffusion is computer program that can turn text into images, referred to as a machine learning or deep learning model. It was released in 2022 as an open source project, meaning anyone can read and make their own version of the source code. This is not the case for it's predecessors.

play with it here → dreamstudio.ai
search images → lexica.art

Below are notes on stable diffusion and related machine learning concepts taken while working through a Practical Deep Learning course. They are in no particular order.

⚠️ These notes are a work in progress!

Machine Learning Models

A computer program that is trained rather than being written manually. Models are generally trained to accomplish a specific task using training data where the input and desired output is known. The internal logic of the trained program is treated like a black-box.

more...

In it's simplest form, Machine Learning seems to just be a particular strategy for getting a computer to do a complex task. Say you want a computer to be able to do something like play chess or tell you if an image has a blue horse in it or not. A programmer tries to write some code that does this and realizes its crazy to try to manually code for every possible situation. So they pivot to a Machine Learning approach where they set up a training scenario that allows some code to write itself, making educated guesses on how to get the desired output with a wide variety of input and reinforcing itself in small parts when it's successful until it has a working version of itself.

This working version is often called a model and can consistently tell me what I wanted it to tell me, like if a random input image has a blue horse in it or not. They usually aren't perfect, maybe our blue horse identification model is accurate 92% of the time, but it can work consistently with literally any range input. if the programmer who set up this training scenario looked at the code inside the model they wouldn't necessarily have any idea why it was doing any individual thing it was doing, because they just coded the training not the model. This seems to cover what machine learning is in a very general sense, but is probably not a perfect explanation by any means. Head to wikipedia for more obviously.

Generative Models

Machine learning models that are trained to produce complex outputs like writing and art. An image generation model might take an input like 'an astronaut riding a horse' and be able to produce thousands of convincing realistic images of this fictional scene.

more...

Generative Models seem to be an extremely academic subject (https://en.wikipedia.org/wiki/Generative_model) that might lean more towards math than programming. But in practical computer programming terms they seem generally be a kind of reversal of identification models. For instance if a programmer trained a machine learning model to identify if there was a blue horse in an image it would spit out some kind of simple `yes, there's a blue horse in this image` or `no, there's no blue horse in this image`. If we were trying to make sure no one was allowed to post images of blue horses on a website or return a bunch of images of blue horses in search engine results this would be an extremely practical tool. But if we wanted to do something weirder that programmer could potentially reverse that model so that it took `yes, there's a blue horse in this image` as the input and then output a completely made up image that had a horse in it. This would be purely based on all the model's unintelligible internal code and would probably look a little weird since it wasn't directly based on anything real. This also doesn't have as many obvious practical applications, but it clearly seems to be a very powerful tool.

GPT-3

GPT-3 is a text-generation model that can write essay-length text from very short prompts. It stands for Generative Pre-trained Transformer and was released in 2020 by OpenAI in San Francisco and is now licensed exclusively to Microsoft. It is not open source but can be used freely in limited amounts.

more...

GPT-3 (https://en.wikipedia.org/wiki/GPT-3) is a text-generation model that can write essay-length text from very short prompts like 'Write a story about a secret forest full of unicorns.' The GPT stands for Generative Pre-trained Transformer. It's widely regarded to be capable of producing writing that is indistinguishable or nearly indistinguishable from an actual person's writing and has been identified as having dangerous societal implications. The New York Times has described GPT-3's capabilities as being able to write original prose with fluency equivalent to that of a human.

It's the third generation model of the GPT-n series created by OpenAI in San Francisco and has been fully licensed by Microsoft since 2020. Individuals can still use it in limited amounts but code itself is not publicly accessible. Internally it uses a Transformer architecture, which seems to distinguish itself from other Machine Learning architectures by processing all the separate parts of it's input simultaneously rather than sequentially, like processing all three words in the input colorless green ideas rather than sequentially processing colorless, green then ideas in order. What that actually means seems difficult to get at from a non-academic perspective but that may have something to do with the fact that the source code is not publicly available.

DALL-E

DALL-E is a text-to-image generative model also developed by OpenAI in San Francisco first released in 2021. The code is not publicly available but is free to anyone for limited use. It's named after WALL-E the Pixar character and Salvador Dali.

more...

DALL-E / DALL-E 2 (https://en.wikipedia.org/wiki/DALL-E) is a text-to-image generative model also developed by OpenAI in San Francisco first released to limited public access in 2021. It uses a modified version of GPT-3 under the hood to produce images from written prompts. Similar to GPT-3 the code is not publicly available but is free for anyone to make a certain number of free uses of, then paying for additional uses. The name is a combination of `WALL-E` the robot Pixar character and `Salvador Dali` the 20th century painter. As a side-note `WALL-E` subsequently is an abbreviation of `Waste Allocation Load Lifter: Earth-Class` which may or may not mean anything. Like GPT-3 it's based on a Transformer architecture that seems to refer generally to processing individual parts of the input in parallel.

Midjourney

Midjourney is a text-to-image generation made by a San Francisco company also called Midjourney. It was released in 2022 via Discord and is free to use up to a limited point. It is not open source.

more...

Midjourney (https://en.wikipedia.org/wiki/Midjourney) is a company whose primary product / focus is a text-to image generation model of the same name. They are based in San Francisco and entered an open beta phase in 2022 that allows people to make text-based requests to it's image generation model via Discord for free up to a limited point, after which they need to pay for additional requests.

As a potential point of interest, Midjourney's founder David Holz has stated that he sees visual artists as customers rather than competitors that might use text-to-image generation tools like Midjourney to rapidly prototype potential works for clients (or themselves) before committing to fully actualizing an artwork.

Craiyon

An open-source text-to-image generation model / platform / tool developed by Hugging Face based in New York. It was previously named DALL-E mini but the name was changed after a request to do so by the DALL-E team.

Hugging Face

A New York based company focused on developing open-source ai resources. As of October 2022 they provide the most accessible downloadable form of stable diffusion.

Stable Diffusion

Stable Diffusion is an open source text-to-image generation model developed at a University in Munich Germany. It's architecture and source code are public and will be explored in more detail in these notes.

Diffusers

The library of stable diffusion tools freely provided by Hugging Face based in New York. As of 2022 this is the easiest way to download a working version of stable diffusion.

Pipeline

Sometimes referred to as a learner, a pipeline ontains a bunch of automated processes like inference and models etc. You can save a pipeline to Hugging Face's library which they refer to as the hub. You can also download pre-trained pipelines from them.

Unet

Transforms a noisy latent of an image into an unoisy latent by identifying the noise and returning that as output. It's a type of neural net that forms the first key component of stable diffusion.

VAE

A decoder that can transform a small latent as input and output a larger uncompressed version of that image. It forms the complement to the CLIP text encoder and is responsible for creating our images by decoding latents / embeddings created by the CLIP from our original input text.

Decoder

A process that takes a compressed / latent version of an image and uncompresses it back to it's original state. It's made up of a series of inverse-convolution layers that each uncompress the image in stages.

Latents

Compressed versions of images after they are run through an encoder. Using latents is not fundamentally necessary machine learning but makes the process more efficient.

Guidance

An indication of what we're trying to remove noise from. Like the number three in a noisy image with a three somewhere in it.

Text Encoder

A black box function that takes text as input and outputs a compressed / latent representation of that text. We don't care about the internal architecture but it is a neural net containing weights that has inputs, outputs and a loss function.

Image Encoder

A black box function that takes an image as input and outputs a compressed / latent representation of that image. Again we don't care about the internal architecture but it is also a neural net containing weights that has inputs, outputs and a loss function.

Vectors

Another term for the latent or compressed version of an input to a neural net encoder. It seems to always refer to a three dimensional array, meaning a data structure with

CLIP Text Encoder

An encoder that takes text as input and produces encoded embeddings / vectors / latents where similar text inputs produce similar embeddings.

more...

More generally a CLIP also seems to be referred to as a multi-modal model set. In the context of stable diffusion, the CLIP refers to a paired image encoder and a text encoder that produce similar compressed versions of their inputs, like the text 'horse' and an image of a horse that both compress to similar vectors / latents. Paired together these allow us to turn text prompts into images.

Dot Product

The multiplied product of two compressed versions of input images or text (also called latents / vectors / embeddings). If a dot product is high it indicates that

Contrastive Loss

Seems to refer to a table where we can compare the respective latents / vectors from images and text to determine when they are related or unrelated. The CL in CLIP refers to contrastive loss.

Distillation / Teacher / Student Networks

When a working model with many steps is consolidated down to a smaller number of steps by training a student network / unet to skip or consolidate steps. This can be performed again and again to minimize the number of steps.

Notes on Alignment

Alignment / Misalignment

An aligned ai is one that effectively pursues the goals intended by it's designers / engineers. A misaligned ai is one that might effectively pursue a goal, just not the one intended by the humans who created it. The United Nations has called for research and policies to ensure that AI systems are aligned with human values.

AI Alignment Research

AI Alignment is a subset within the field of AI Safety in the larger field of Artificial Intelligence research. The goal of AI Alignment as a professional focus is to steer / align AI systems towards the goals and interests of the designers and engineers creating them.

Proxy Goals

Reward Hacking / Goodhart's Law

Instrumental Behaviors

Emergent Goals

Capability Control

Explainable AI / Interpretable AI

AI whose decisions can be understood by human beings. Contrasted with the idea of black-boxed AI where it's individual decisions cannot be understood or backtraced by the engineers who designed the model and facilitated it's training.

Norbert Wiener / The Alignment Problem

"If we use, to achieve our purposes, a mechanical agency with whose operation we cannot interfere effectively ... we had better be quite sure that the purpose put into the machine is the purpose which we really desire." (1960)

Stuart Russell

Recommender Algorithms

Stanford researchers observed that the recommender algorithms used in social media platforms are misaligned with their users because they simply optimize for engagement metrics and not social or user well-being which is harder to measure.

Asimov's Laws

Asimov's Three Laws of Robotics has been suggested as an inspiration for possible hard constraints that could ensure AI alignment. But Russel and Norvig argue that this overly simplifies the complexities with which a misaligned ai could pervert the constraints and goals of it's engineers, and that humans are not realistically capable of predicting all the concrete ways in which a misaligned ai could harm human interests. Regardless the laws, which are originally from the short story "Runaround" are:

A robot may not injure a human, or through inaction, allow a human to come to harm.
A robot must obey the orders given it by human beings except where such orders would conflict with the First Law.
A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.

Systemic Risks

Companies and governments have already established a track record illustrating a tendency to deploy misaligned AI's due to competitive pressure. The largest example being social medias unintended effects of mass addiction, social alienation and political polarization.

Emergent Power-Seeking / Convergent Instrumental Goals

Current AI models still lack the abilities of long-term planning and strategic awareness, which are thought to pose the most serious risks to humanity. Such systems may independently develop as an emergent behavior capable of maximizing the ability to achieve a wide variety of smaller practical goals by seeking to maximize control over their environment. This is referred to as power-seeking or developing Convergent Instrumental Goals. This behavior has already been observed in limited scenarios.

tombetthauser/stable-diffusion

What is Stable Diffusion?

Machine Learning Models

Generative Models

GPT-3

DALL-E

Midjourney

Craiyon

Hugging Face

Stable Diffusion

Diffusers

Pipeline

Unet

VAE

Decoder

Latents

Guidance

Text Encoder

Image Encoder

Vectors

CLIP Text Encoder

Dot Product

Contrastive Loss

Distillation / Teacher / Student Networks

Notes on Alignment

Alignment / Misalignment

AI Alignment Research

Proxy Goals

Reward Hacking / Goodhart's Law

Instrumental Behaviors

Emergent Goals

Capability Control

Explainable AI / Interpretable AI

Norbert Wiener / The Alignment Problem

Stuart Russell

Recommender Algorithms

Asimov's Laws

Systemic Risks

Emergent Power-Seeking / Convergent Instrumental Goals

The Future of Humanity Institute