> I didn’t know that GANs were still in use Their still in use literally by ever...

Two_hands · on Aug 6, 2024

I haven't looked into latent diffusion yet. But what are you saying the output is converted to images/audio using GANs?

godelski · on Aug 6, 2024

FYI, when most people say "diffusion" they are referring to "latent diffusion" (which is identical to "stable diffusion"). As for GAN's role, it's more like what I reference in the other comment. I wouldn't call them "part" of every (latent) diffusion model, but I would say they're a common part of the pipeline to the production of quality images (so I'll not deny "part").

As for audio, the above comment is true. This is typically at (as referenced) the end stage of the model. You'll also find Normalizing Flows commonly used in the middle of the model and used so you can have interpretable control over your latent space. NFs are a commonly overlooked architecture, but if you get to learning about Neural ODEs (NODEs), SDEs, Schrodinger Bridges, etc, then you'll find these are in the same family of models. If you like math you'll likely fall in love with these types of models.

Two_hands · on Aug 6, 2024

> you'll likely fall in love with these types of models

Sounds like a fun rabbit hole to fall into.

Thanks for the insights I wasn't aware that GANs are still so prevalent. And I haven't heard of a lot of these methods, I'll check them our for sure.

radarsat1 · on Aug 6, 2024

Latent diffusion models operate in latent space. This space is generated by an encoder and decoded back into pixel space by a decoder. The encoder-decoder form a generator which is trained to have good visual quality through the use of an adversarial loss.

So the encoder produces a latent space that is more efficient to train a diffusion model on, since diffusion models use Unet-like architecture that must be run many times for a single inference. The latent space is restricted by a KL penalty to a Gaussian shape such that any sample from that shape will map through the decoder to a high-quality image. This makes the generative job of the diffusion model much easier because it can focus on content and semantics rather than pixel-level details

You can see the two optimisers at work in the AutoencoderKL class in the Stable Diffusion source code here: https://github.com/CompVis/stable-diffusion/blob/main/ldm/mo...