What Is Stable Diffusion and How Does It Work?
For the past few years, revolutionary models in the field of AI image generators have appeared. Stable diffusion is a text-to-image model of Deep Learning published in 2022. It is possible to create images which are conditioned by textual descriptions. Simply put, the text we write in the prompt will be converted into an image! How is this possible?
Stable diffusion is a version of the latent diffusion model. Latent spaces are used to get the benefits of the low-dimensional representation of the data. After that, diffusion models and methods of adding and removing the noise are used to generate the image based on the text. In the following chapters, I will describe latent spaces in more detail as well as the way diffusion models function and I will provide an interesting example of the image which the model can generate based on the given text.
Latent space
Disguised, latent space is, simply put, the representation of compressed data. The compression of data is defined as a process of encoding information by using smaller bits than in the original representation. Let’s imagine that we have to present a 20-dimensional vector using a 10-dimensional vector. By reducing dimensionality we are losing data. However, in this case, this is not a bad thing. Reducing dimensionality allows us to filter less important information and keep only the most important information.
In short, let’s say that we want to train the model that classifies images by using fully connected convolutional neural networks. When we say that the model is learning, we mean that it is learning specific attributes on each layer of the neural network. These are for instance edges, specific angles, shapes, etc. Each time when the model has to learn by using data (an already existing image), the dimensions of the image are reduced before they go back to their original size. In the end, the model reconstructs the image from compressed data by using a decoder, while learning all the relevant information beforehand. Therefore, the space becomes smaller so that the most important attributes are extricated and kept. This is why latent space is suitable for diffusion models. It is very useful that there is a way to single out the most important attributes from a training set of large number of images where there are many details, and that these attributes can be used to classify two arbitrary objects in the same or different category.
Extricating the most important attributes by using convolutional neural networks (Source: Hackernoon)
Extricating the most important attributes by using convolutional neural networks (Source: Pytorch)
Diffusion models
Diffusion models are generative models. They are used to generate data that are similar to the data on which they have been trained. Fundamentally, diffusion models function in a way they “destroy” trained data by iteratively adding Gaussian noise and then they learn how to bring back the data by eliminating the noise.
Gaussian noise (Source: Hasty.ai)
Forward diffusion process is the process where more and more noise is added to the picture. Therefore, the image is taken and the noise is added in t different temporal steps where in the point T, the whole image is just the noise. Backward diffusion is a reversed process when compared to forward diffusion process where the noise from the temporal step t is iteratively removed in temporal step t-1. This process is repeated until the entire noise has been removed from the image using U-Net convolutional neural network which is, besides all of its applications in machine and deep learning, also trained to estimate the amount of noise on the image.
(Source: NVidia Ddeveloper)
From the left to the right, the picture demonstrates iterative adding of noise, and from the right to the left, it showcases iterative removing of the noise.
For estimation and removing of the noise, U-Net is used most often. It is interesting that that neural network has an architecture which reminds us of the letter U, which is how it got its name. U-Net is a fully connected convolutional neural network which makes it very useful for processing on the images. U-Net is distinguished by its ability to take the image as the entrance and find low-dimensional representation of that image by reducing the sampling, which makes it more suitable for processing and finding the important attributes, and then it reverts the image to the first dimension by increasing the sampling.
(Source: https://lmb.informatik.uni-freiburg.de/people/ronneber/u-net/)
In more details, removing the noise, that is transitioning from arbitrary temporal step t in the temporal step t-1, where t number is the number between T0 (the image without the noise) and the final number TMAX (total noise) happens in the following way: input is the image in the temporal step t, and in that temporal step there is a specific noise on the image. Using the U-Net neural network, a total amount of noise can be predicted, and then a “part” of the total noise is removed from the image in the temporal step t. This is how you get the image in the temporal step t-1 where there is less noise.
(Source: NVidia Ddeveloper)
Mathematically, there is much more sense to conduct this method T number of times than try to remove the entire noise. By repeating this method, the noise will be gradually removed and we will get a much “cleaner” image. A simplified process is as follows: there is an image with the noise and we try to predict the image without the noise by adding complete noise on the initial and removing it iteratively.
(Source: Prog.World)
Text
Inserting the text in this model is done by “embedding” words using language transformers, which means that numbers (tokens) are added to the words, and then this representation of the text is added to the input (to image) in U-Net, it goes through each layer of U-Net neural network and transforms together with the image. This is done from the first temporal iteration and the same text is added to each following iteration after the first estimation of the noise. We could say that the text “serves as a guideline” for generating the image starting from the first iteration where there is a complete noise and then further down to the entire iterative method.
Stable Diffusion
Why stable diffusion? The biggest problem diffusion models have is the fact that they are extremely “expensive” when it comes to time and calculations. Taking into account the way U-Net functions, stable diffusion overcomes the above-mentioned problems. If we wanted to generate the image, with 1024x1024 dimensions, U-Net should use the noise of 1024x1024 size and then make the image out of it. This can be an expensive approach for only one diffusion step, especially if that method would be repeated at times where t can be even hundred. In the following example, the number of temporal steps is 45. Stable diffusion model generated two images for the given number of steps in around 6 seconds while some other diffusion models needed even between 5 and 20 minutes depending on GPU specification and the size of the image. The most interesting part is that a stable diffusion model can be set in motion successfully locally with even 8 GB VRAM memory. So far, this problem has been solved by training on a smaller image with the size of 256x256, and then an additional neural network would be used which would produce an image in a bigger resolution (super-resolution diffusion). Latent diffusion model has a different approach. Concept latent diffusion models do not operate directly on the image but in the latent space! The initial image is actually encoded in much smaller space, so that the noise is added and removed on a low-dimensional representation of our image by using U-Net.
The final architecture of the stable diffusion model
Possibilities
Considering the complex architecture of the mode, stable diffusion can generate the image "from scratch" by using a text prompt where it can be described which elements should be shown or left out from the output. Also, it is possible to add new elements on the already existing image also by using the text prompt. Finally, I’d like to show you one example — on the website huggingface.co, there is a space where you can try different things with a stable diffusion model. Here are the results when a text “A pikachu fine dining with a view to the Eiffel Tower” is written in the prompt:
A direct link to the space: https://huggingface.co/spaces/stabilityai/stable-diffusion
Stable diffusion gained a huge publicity over the last couple of months. The reasons for that are the fact that the source code is available to everybody and that stable diffusion does not hold any rights to generated images, and it also allows people to be creative which can be seen from a large number of incredible published images which this model created. Generative models are developed and improved, we can freely say, on a weekly basis and it will be very interesting to keep track of their future progress.