AI Art Generation Handbook/Diffusion Model

Note 1: There are many types of diffusion model in the public right now and each AI Art -To-Text Models have different implementations

(Some maybe even better / completely different implementation to what is presented here.)

Note 2: This chapter is dedicated to the overall ideas / layman guide how the diffusion model works for public mass

Forward Diffusion Model

First of all, to train the diffusion models, the forward diffusion model will add noises layer by layer .

In this example, we have an image of Rhino by the rocky riverbed for example. From T=0 (The original images) , forward diffusion model will add more and more noise consecutively for each steps (T) until the images is virtually unrecognisable. Although it seems counter intuitive, this whole process will teach forward diffusion model will teach denoising diffusion model on how to remove noise from an images.

Diffusion Denoising Model

Next process , we have diffusion denoising model which it will learn from the prior forward diffusion model on how to remove noise from the images.

The diffusion denoising model will try to remove one "step" of noise from the input images and do it over-and-over again until it hopefully be able to re-construct the images at T=0 (original image).

But however, this process will *almost never be able to create exact replica of original images and will introduces little variances amongst the output.

Latent Diffusion Model

According to Wiktionary, latent in this context meant lying dormant or hidden until circumstances are suitable for development

Hence, latent diffusion models in this model context meant that instead of applying the diffusion process directly on a overall images, the models project the input into a compressed image representations/latent space (instead of full images) and apply the diffusion models to reconstruct the images.

Latent space can be visualised as shown on the left picture (as 3D scatterplots for ease of understanding)

For examples, we might need to classify animals such as Javan rhinoceros .

As for Javan rhinoceros , there are many types of classifications such as animals horns, gray coloured skins, near-extinct animals and such .

To increase efficiency and space savings, latent space only cares about the most important and distinguishable features for Javan rhino (i.e. lived in Javan rainforest, have one horn animal) . Latent space also care the unlikeliness for compressed image representation overlapped with others features .

Hence, the Javan rhinoceros distinguishable features are clustered in the latent space representation of data.

As a result, the representations of chairs become less distinct and more similar. If it were to imagine them in latent space representation, Javan rhinoceros dataset would be ‘closer’ together.

References

https://web.archive.org/web/20221129071811/https://www.louisbouchard.ai/latent-diffusion-models/

https://towardsdatascience.com/what-are-stable-diffusion-models-and-why-are-they-a-step-forward-for-image-generation-aa1182801d46

https://web.archive.org/web/20221129072114/https://www.assemblyai.com/blog/diffusion-models-for-machine-learning-introduction/

https://medium.com/from-the-diaries-of-john-henry/denoising-diffusion-based-generative-modeling-5daadc1d8ce2

https://towardsdatascience.com/understanding-latent-space-in-machine-learning-de5a7c687d8d