The idea of connecting CV to audio via spectrograms pre dates Jeremy Howard's course by quite a bit. That's not really the interesting part here though. The fact that a simple extension of an image generation pipeline produces such impressive results with generative audio is what is interesting. It really emphasizes how useful the idea of stable diffusion is.
edit: added a bit more to the thought