The Path to StyleGan2 – Implementing the Progressive Growing GAN

godelski · on Aug 5, 2024

I know GANs aren't all the rage now, but if you're interested in ML, they should not be overlooked.

We still use GANs a lot. They're way faster than diffusion models. Good luck getting a diffusion model to perform upscaling and denoising on a real time video call. I'm sure we'll get there, but right now you can do this with a GAN on cheap consumer hardware. You don't need a 4080, DLSS was released with the 20 series cards. They are just naturally computationally cheaper, but yeah, they do have trade-offs (though arguable since ML goes through hype phases and everyone jumps ship from one thing to another and few revisit. But when revisits happen, they tend to be competitive. See ResNets Strike Back for even CNNs vs ViTs. But there's more nuance here).

There is a reason your upscaling model is a GAN. Sure, diffusion can do this too. But why is everyone using ESRGAN? There's a reason for this.

Also, I think it is important to remember that GAN is really about a technique, not about generating images. You have a model generating things, and another model telling you something is a good output or not. LLM people... does this sound familiar?

To the author: I think it is worth pointing to Tero Karras's Nivida page. This group defined the status quo of GANs. You'll find that the vast majority of GAN research built off of their research. As quite a large portion of are literal forks. Though a fair amount of this is due to the great optimization they did, with custom cuda kernels (this is not the limiting compute factor in diffusion). https://research.nvidia.com/person/tero-karras

Two_hands · on Aug 6, 2024

I didn’t know that GANs were still in use, that’s pretty cool.

As a technique I think it’s quite stunning, from an ML perspective. Hence why I’ve decided to write these blog posts. The GAN just has something about which makes it riveting to work with.

I’ve realised that Tero Karras made major contributions, I can across the PGGAN from the StyleGAN2. What did you mean by your last sentence, what is the limiting compute factor for GANs?

radarsat1 · on Aug 6, 2024

> I didn’t know that GANs were still in use

Their still in use literally by every latent diffusion image generator, they typically target the latents of a GAN-trained decoder.

Same for audio, most audio models generate some representation that is converted to audio by a GAN-trained codec.

Two_hands · on Aug 6, 2024

I haven't looked into latent diffusion yet. But what are you saying the output is converted to images/audio using GANs?

godelski · on Aug 6, 2024

FYI, when most people say "diffusion" they are referring to "latent diffusion" (which is identical to "stable diffusion"). As for GAN's role, it's more like what I reference in the other comment. I wouldn't call them "part" of every (latent) diffusion model, but I would say they're a common part of the pipeline to the production of quality images (so I'll not deny "part").

As for audio, the above comment is true. This is typically at (as referenced) the end stage of the model. You'll also find Normalizing Flows commonly used in the middle of the model and used so you can have interpretable control over your latent space. NFs are a commonly overlooked architecture, but if you get to learning about Neural ODEs (NODEs), SDEs, Schrodinger Bridges, etc, then you'll find these are in the same family of models. If you like math you'll likely fall in love with these types of models.

Two_hands · on Aug 6, 2024

> you'll likely fall in love with these types of models

Sounds like a fun rabbit hole to fall into.

Thanks for the insights I wasn't aware that GANs are still so prevalent. And I haven't heard of a lot of these methods, I'll check them our for sure.

radarsat1 · on Aug 6, 2024

Latent diffusion models operate in latent space. This space is generated by an encoder and decoded back into pixel space by a decoder. The encoder-decoder form a generator which is trained to have good visual quality through the use of an adversarial loss.

So the encoder produces a latent space that is more efficient to train a diffusion model on, since diffusion models use Unet-like architecture that must be run many times for a single inference. The latent space is restricted by a KL penalty to a Gaussian shape such that any sample from that shape will map through the decoder to a high-quality image. This makes the generative job of the diffusion model much easier because it can focus on content and semantics rather than pixel-level details

You can see the two optimisers at work in the AutoencoderKL class in the Stable Diffusion source code here: https://github.com/CompVis/stable-diffusion/blob/main/ldm/mo...

godelski · on Aug 6, 2024

Oh yeah, they're still alive but you won't see them getting published as often due to both most people switching to diffusion and the self fulfilling prophesy of considering things dead. But yeah, if you look at any diffusion platform like Automatic1111 you'll find that GANs are a popular choice of upscaler[0,1]. So you use them together to try to benefit from each of their advantages.

Also, if you look at the top scores for FFHQ at the 256[2] or 1024[3] resolutions, you see GANs winning, and by a good amount. The best diffusion model is #4, and LDM (Stable Diffusion) is #25. Most diffusion research has avoided this dataset due to scale, but this is changing. Probably worth noting that StyleSAN is about a method, not an architecture. Also the #2 on [2] looks to be a smaller lab and they complain about limited compute and spend time arguing about why they think if they scale they'd perform better. They do have some compelling evidence given their FFHQ success is beating much bigger models. But they don't seem to have as much success on LSUN. They also are less successful on 1024, but they again claim limited compute so hard to say. They don't appear to be published in a conference, so I guess they are in fact stuck.

  > What did you mean by your last sentence, what is the limiting compute factor for GANs?

Sorry, I meant the limiting compute factor for diffusion. Why GANs are faster. I felt it was worth mentioning since I mentioned that Karras wrote custom cuda kernels for StyleGAN, and this does have a significant impact on speed. In the appendix of StyleGAN2 at the end of B under "Performance optimizations" they mention that their kernels result in an improvement of 30% in training time and 20% improvements on memory footprint.

But the limiting factor between diffusion models and (typical) GANs is that GANs are typically formulated with just a decoder. On the other hand, Diffusion has a full encoder decoder network. This is even true for Latent Diffusion models (i.e. Stable Diffusion), which specifically was designed to tackle the compute challenges of a standard diffusion model. The backbone is a UNet (almost a VAE + residual connections), which is an autoencoder and decoder (there are ViT based backbones, but these are still in the same parameter ballpark). So it is just a challenging architecture to reduce in size. There are clear benefits for doing so, but when it comes to practical applications you have to consider a wide variety of factors. I mean think about the computational costs of generating a 256x256 image with SD, that's a few gigs on your GPU. You need procumer hardware to get 1024 and I can tell you that on a 4080S that images are not instantaneously generated lol. So you're not going to use that in a compute constrained environment like gaming. But on the other hand, I can generate 60 imgs/s on a 2080Ti with StyleGAN2 (haven't checked on my 4080S). There are things like ArtSpew, that start getting closer but the image quality is crap (this is being improved FWIW). But also PGAN is crazy fast...

For more specifics I'm not sure how to accurately explain without getting into the math and a conversation about density estimation. But I don't think that is well suited for a HN conversation. This should be enough to point you in the right direction for that though.

[0] https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki...

[1] https://medium.com/rendernet/using-hires-fix-to-upscale-your...

[2] https://paperswithcode.com/sota/image-generation-on-ffhq-256...

[3] https://paperswithcode.com/sota/image-generation-on-ffhq-102...

Two_hands · on Aug 6, 2024

> the self fulfilling prophesy of considering things dead

This is quite sad, GANs are an amazing piece of tech and it doesn't seem like they are finished yet. The rule in ML is that it's never over for a method, so maybe someone somewhere will get GANs fashionable again. There's many things like this in ML though...

On the FFHQ point, are you saying currently GANs are better at benchmarks like FFHQ where the target is realistic looking images? Or better at representing the training data?

> Karras wrote custom cuda kernels for StyleGAN

I didnt know they wrote custom kernels, perhaps for my StyleGAN post I can try triton and write a custom kernel for the operations. However, I've never looked into this.

What does it mean to have a backbone? Does it just mean the underlying architecture used in the method? Also, on the decoder only vs encoder-decoder point, taken that way it's very difficult (almost impossible) to have diffusion models have a better efficiency than GANs?

Thanks for the detailed comment, you've given me a lot to think about.

godelski · on Aug 6, 2024

  > The rule in ML

There's definitely attempts to revive things (in the general sense, not just GANs), but most successes appear to come from large labs dedicating equal computational resources to the older models and often by changing names. This can make things more confusing and make things appear to be changing faster, but once you can see this, you'll have an easier time keeping up (so being new, watch out for this). I'll give some examples that are easier to read[0-2] (i.e. don't need expert knowledge to understand the nuances).

As an insider (ML researcher), my complaint isn't so much about that we have a large proportion of people chasing one specific avenue, it is that we gate keep newer methods. I think you can see a similar effect on HN when new models are proposed. They are trashed due to lack of beating existing models (this is true even beyond ML!). There will always be reasons to critique works and I don't want discourage criticism, but I do want to discourage dismissal. It hinders progress, because progress is made in small steps, not leaps and bounds. I think this can get confusing for someone entering the field (I'm sorry if I've misjudged, I'm inferring from the comments).

  > On the FFHQ point

This is an excellent question that unfortunately I don't know the answer to. I think you'll find this work helpful[3], it has the largest human study. But despite its name, StyleNAT performed best in specifically FFHQ. What I would say is that there are good arguments to make the diffusion models are better at representing a diversity of images (making them well suited for things like art generation) but theoretically GANs are approximating your full density distribution. There's some talks by Goodfellow discussing this but I don't recall which ones.

  > I didnt know they wrote custom kernels

As you've probably found, the StyleGAN code is not the easiest to read lol. Since you're using pytorch, you can find them here[4]. I encourage you to look at these, especially if you've never seen CUDA code before. Because the biggest takeaway will be that you'll see how easy it is to add a custom kernel, and given the earlier comment you'll see the utility ^__^

I'm not going to discourage you from trying triton, but I'll note that pytorch's compile goes a long way. Definitely _start_ there (see TensorRT).

  > What does it mean to have a backbone? Does it just mean the underlying architecture used in the method?

Exactly! So in the case of a diffusion model it is the UNet (the neural network part, and specifically this part estimates the parameters for the probability distribution. If this doesn't make sense now, it will later. If you are struggling to understand diffusion models after spending some time reading the papers, come back to this comment). You'll also find the term "backbone" used in application based models such as in Semantic Segmentation, Object Detection, Pose Estimation, and much more. In those cases, these are typically pretrained, so recognize this as a hyper-parameter.

  > Also, on the decoder only vs encoder-decoder point

I'm going to say something frustrating. In short: yes. If we get a but more nuanced: no. If we get really nuanced: yes. I know this isn't a great answer, but it can be really difficult to understand. On the surface, yes because you need to encode the variable and your model needs to transform a dimension starting at R^N and ending in R^N. While a (I have to stress, colloquial[5]) GAN transforms a R^M space to R^N where M << N. With more nuance you can argue it is the backbone. But to be detailed, you'll find that there are fundamental factors placing computational bounds on the theoretical performance of these architectures. To get there you'll need to carefully study Goodfellow's original paper (some follow-ups expanding on the analysis) as well as the "original" diffusion paper by Sohl-Dickstein[6] (quotes because this is debatable, but the claim has reasonable merit), and you should become familiar with Aapo Hyvärinen[7]. The last is by far the hardest part and the confusion is normal. I know quite a few well renowned and intelligent people who struggled (personally I went through the "this is hard", "this is easy", "ah fuck, I actually don't understand anything" cycle for a bit. But that's just a signal that you're progressing :).

  > Thanks for the detailed comment, you've given me a lot to think about.

Great! I hope this can help provide direction (I read your other post). The best and worth thing about machine learning is that there's so much depth. It can both be intimidating and easy to miss. But if you're passionate about learning (as it appears) you'll find the knowledge gained is highly rewarding, if unfortunately hard to gain.

And I apologize for being so verbose. It is a bad habit.

[0] Diffusion Models Beat GANs on Image Synthesis (https://arxiv.org/abs/2105.05233) The two authors are rockstars. Given your blog post, I think you'll enjoy a lot of their work. Which includes diffusion.

[1] ResNet strikes back: An improved training procedure in timm (https://arxiv.org/abs/2110.00476) Again, all three are great people to follow. You won't see many papers by Wightman, but you'll see his work with (now) Hugging Face. Notably he's one of the most important players in ViTs.

[2] A ConvNet for the 2020s (https://arxiv.org/abs/2201.03545) and ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders (https://arxiv.org/abs/2301.00808)

[3] Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models (https://arxiv.org/abs/2306.04675)

[4] https://github.com/NVlabs/stylegan2-ada-pytorch/tree/main/to...

[5] I'm sorry, I still have difficulties explaining this, especially simply. There's a lot of points here. But one is easy to understand and what I mentioned before: GAN is a training method, not an architecture. A bit more nuance will be found by reading this far underappreciated work: https://arxiv.org/abs/1912.03263. The last point I want to mention is to never forget that "generative" is a general term and these models are good for generating __data__. Images are data, but to think this is the only type of data a GAN (or any model. I literally mean any[3]) can generate is naive. All of this gets harder to explain and I don't have the skill to do so in a simple manner and am afraid it will just come off as a rant.

[6] Deep Unsupervised Learning using Nonequilibrium Thermodynamics (https://arxiv.org/abs/1503.03585)

[7] "Score matching" and "Noise-contrastive estimation" will be the most beneficial https://www.cs.helsinki.fi/u/ahyvarin/papers/

Two_hands · on Aug 6, 2024

> because progress is made in small steps,

This seems easily forgotten by a large number of people. I try to remind myself to step back from the hype and explore the lesser travelled paths.

> I'll give some examples that are easier to read[0-2]

I need to reach ResNet strikes back, it was one the first networks I implemented and it is cool to see it still being worked on.

I'll check out [3]. I've wondered recently how you could get a GAN to generate things out of distribution but that still look like the training data, if that even makes sense.

> the StyleGAN code is not the easiest to read lol

Yup, even the official PGGAN code was quite hard to understand. I'll try out the PyTorch compile I've heard a lot about it recently. I had thought TensorRT was for LLMs I suppose it's applicable in other areas too?

> so recognize this as a hyper-parameter

Okay that makes sense. I'll reread this after exploring Diffusion models too in the future.

> carefully study Goodfellow's original paper

This is something I have not done, my current workflow is just to understand how best to implement what is written. I think deep exploration is the next step, no matter how many "I know nothings" I will experience. This side of GANs I had not considered (the theoretical, it looked interesting but very complex).

> I hope this can help provide direction

It certainly will, I imagine I'll come back to this comment many times. Thanks for taking the time to read my posts and provide so much material for further study.

> if unfortunately hard to gain

I agree it is rewarding and I hope I can purvey some of this knowledge in my blog for others too! That was why I started it, so much knowledge is locked away and hard to access or understand without some guidance.

godelski · on Aug 6, 2024

  > This seems easily forgotten

I think an undervalued exercise is learning the history of a field. The value helps with this but also helps in teaching you how to tackle problems. Because you need to understand the motivations and things they had available at the time.

  > ResNet strikes back

Don't forget ConvNext!

  > how you could get a GAN to generate things out of distribution

OOD is a fuzzy term, used fast and loose. You're not really generating anything out of distribution. And remember that generative models are often improperly tested for generalization. Even most models are. If you tune your parameters on a hold out set, well then it isn't a hold out set, it is a validation set. You've provided additional information to the model: information leakage. There are also major limitations to all the metrics. You'll find the exercise with FID fairly enlightening. One major assumption is the belief that the normalization layer results in a normal distribution. Do you take this at face value? I also suggest looking into CleanFID. You'll find some surprising results if you dig deeper. Never fool yourself into thinking that metrics are objective, they are models. You can never directly measure the thing you intend to. Sometimes this proxying doesn't matter, sometimes it does. In either case, we shouldn't forget. To make this clearer, when you measure with a ruler you don't measure the length of an object in meters, you measure the length of the object in relation to your ruler. Go get a few and see how exact they are. Or go to your physics department, find the experimentalists and trade them a beer for rants on meta physics (Ian Hacking can be a good place to start).

  > the official PGGAN code

It is a fork of StyleGAN

  > TensorRT

It is general. In fact, you don't even require machine learning code. Though that is what it is targeted at. And I want to point this out, because it is an easy trap to fall into. One I fell into when starting and one many never escape from. Stop thinking about models and architectures as applications. See the forest, but don't forget the trees, the shrubs, moss, mushrooms, and all the other things in the forest. Look closely at the LLM and don't just find the differences between other architectures, but also find the similarities. It's kinda like people: easier to see the differences between us, especially because we're so similar.

  > the theoretical, it looked interesting but very complex

I tell my students: you don't need math to make good models, but you do need math to know why your models are wrong. If you don't know what's referenced in the second part, seek out a mentor who will tell you. The barrier to entry is low but don't forget the fundamentals. It's like with any programming. Your success can cause you to stop progressing because "why do I need more when this works?" It's hard work, but highly fruitful.

Keep it up. It's easy to get discouraged, but don't let that stop you. You're not as far behind as you might think.

Two_hands · on Aug 7, 2024

> You're not as far behind as you might think.

Thanks!

> you don't even require machine learning code

Interesting, I'll check it out for sure

> ConvNext

I certainly won't its on my papers to read list.

> seek out a mentor

Finding one seems to be the hard part.