FYI, when most people say "diffusion" they are referring to "latent diffusion" (which is identical to "stable diffusion"). As for GAN's role, it's more like what I reference in the other comment. I wouldn't call them "part" of every (latent) diffusion model, but I would say they're a common part of the pipeline to the production of quality images (so I'll not deny "part").
As for audio, the above comment is true. This is typically at (as referenced) the end stage of the model. You'll also find Normalizing Flows commonly used in the middle of the model and used so you can have interpretable control over your latent space. NFs are a commonly overlooked architecture, but if you get to learning about Neural ODEs (NODEs), SDEs, Schrodinger Bridges, etc, then you'll find these are in the same family of models. If you like math you'll likely fall in love with these types of models.
Latent diffusion models operate in latent space. This space is generated by an encoder and decoded back into pixel space by a decoder. The encoder-decoder form a generator which is trained to have good visual quality through the use of an adversarial loss.
So the encoder produces a latent space that is more efficient to train a diffusion model on, since diffusion models use Unet-like architecture that must be run many times for a single inference. The latent space is restricted by a KL penalty to a Gaussian shape such that any sample from that shape will map through the decoder to a high-quality image. This makes the generative job of the diffusion model much easier because it can focus on content and semantics rather than pixel-level details
Their still in use literally by every latent diffusion image generator, they typically target the latents of a GAN-trained decoder.
Same for audio, most audio models generate some representation that is converted to audio by a GAN-trained codec.