Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> I didn’t know that GANs were still in use

Their still in use literally by every latent diffusion image generator, they typically target the latents of a GAN-trained decoder.

Same for audio, most audio models generate some representation that is converted to audio by a GAN-trained codec.



I haven't looked into latent diffusion yet. But what are you saying the output is converted to images/audio using GANs?


FYI, when most people say "diffusion" they are referring to "latent diffusion" (which is identical to "stable diffusion"). As for GAN's role, it's more like what I reference in the other comment. I wouldn't call them "part" of every (latent) diffusion model, but I would say they're a common part of the pipeline to the production of quality images (so I'll not deny "part").

As for audio, the above comment is true. This is typically at (as referenced) the end stage of the model. You'll also find Normalizing Flows commonly used in the middle of the model and used so you can have interpretable control over your latent space. NFs are a commonly overlooked architecture, but if you get to learning about Neural ODEs (NODEs), SDEs, Schrodinger Bridges, etc, then you'll find these are in the same family of models. If you like math you'll likely fall in love with these types of models.


> you'll likely fall in love with these types of models

Sounds like a fun rabbit hole to fall into.

Thanks for the insights I wasn't aware that GANs are still so prevalent. And I haven't heard of a lot of these methods, I'll check them our for sure.


Latent diffusion models operate in latent space. This space is generated by an encoder and decoded back into pixel space by a decoder. The encoder-decoder form a generator which is trained to have good visual quality through the use of an adversarial loss.

So the encoder produces a latent space that is more efficient to train a diffusion model on, since diffusion models use Unet-like architecture that must be run many times for a single inference. The latent space is restricted by a KL penalty to a Gaussian shape such that any sample from that shape will map through the decoder to a high-quality image. This makes the generative job of the diffusion model much easier because it can focus on content and semantics rather than pixel-level details

You can see the two optimisers at work in the AutoencoderKL class in the Stable Diffusion source code here: https://github.com/CompVis/stable-diffusion/blob/main/ldm/mo...




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: