Stable-audio and MusicGen sounds better than Jukebox.
But the best so far is Suno.ai ( https://app.suno.ai ) especially with their V3 model they have very impressive results, the fidelity is not studio quality but they're getting very close.
It's very likely based on their TTS model they have released before (Bark), but trained on more data and with higher resolution.
I tried a dozen of different prompts in suno ai to generate some music - it completely ignored them. It just generated some simple pop sounding tunes every time. I’m not impressed.
The lyrics on those songs are basically a pastiche of everything popular. But just based on sound it's pretty convincing. I bet you could train a generative AI to push out trance bangers as long as they don't have any lyrics.
I mean the user wrote the lyrics (or maybe ChatGPT lol). People of course write more interesting ones (Just found a Ukrainian about Kyiv: https://sonauto.ai/songs/xRDqe57ZgT6QrFXiIzF0). There was another one about Frank Sinatra being attacked with a baseball bat but I can't find it rn.
I just tried Suno and the results to me are terrible.
It seems designed for making pop music no one will listen to.
I have spent many hours with MusicLM making wild experimental music no one will listen to.
MusicLM has no problem making really weird sound combinations.
I just gave SunoAI some of my MusicLM prompts I have saved and the results are garbage. The problem with the AI test kitchen model though is the results sound like they are in mono.
The ultimate for me will be when we can make rap/hiphop no one will listen to.
The two piano examples where the second one has the phase randomized is also an excellent example of why allpass filters, which change the phase but not amplitude of all frequencies, are a building block for digital reverbs. The second piano example with the randomized phases sounds more blurred out and almost reverb-y.
The author makes a distinction between two different modeling approaches:
1. Representing music as a series of notes (with additional information about dynamics, etc.), and then building a model that transforms this musical score into sound.
2. Modeling the waveform itself directly, without reference to a score. This waveform could presumably be in either the frequency or the time domain, but the author chooses to use the time domain.
The author's terminology is a bit confusing, but I think that they mean option 2 when they say "the waveform domain."
The frequency domain is a misleading name. I assume you’re referring to STFT, or spectrogram, which is a series of windowed time segments transformed into the frequency domain.
But both you and OP are right in that waveform domain is usually called time domain.
In a nutshell, diffusion models break up the difficult task of generating natural signals (such as images or sound) into many smaller partial denoising tasks. This is done by defining a corruption process that gradually adds noise to an input until all of the signal is drowned out (this is the "diffusion"), and then learning how to invert that process step-by-step.
This is not dissimilar to how modern language models work: they break up the task of generating text into a series of easier next-word-prediction tasks. In both cases, the model only solves a small part of the problem at a time, and you apply it repeatedly to generate a signal.
For any other DSP people tearing their hair out at the author's liberal terminology, whenever you see "wave" or "waveform" the author is talking about non-quadrature time domain samples.
I feel this work would be a lot better if it was built around a more foundational understanding of DSP before unleashing the neural nets on data that is arguably the worst possible representation of signal state. But then again there are people training on the raw bitstream from a CCD camera instead of image sequences so maybe hoping for an intermediate format that can be understood by humans has no place in the future!
The direct counter-argument to "worst representation" is usually "representation with fewest assumptions", waveform as shown here is getting close. Though recording environment, equipment, how the sound actually gets digitized, etc. also come into play, there are relatively few assumptions in the "waveform" setup described here.
I would say in the neural network literature at large, and in audio modeling particularly, this continual back and forth of pushing DSP-based knowledge into neural nets, on the architecture side or data side, versus going "raw-er" to force models to learn their own versions of DSP-style transforms has been and will continue to be a see-saw, as we try to find what works best, driven by performance on benchmarks with certain goals in mind.
These types of push-pull movements also dominate computer vision (where many of the "correct" DSP approaches fell away to less-rigid, learned proxies), and language modeling (tokenization is hardly "raw", and byte based approaches to-date lag behind smart tokenization strategies), and I think every field which approaches learning from data will have various swings over time.
CCD bitstreams are also not "raw", so people will continue to push down in representation while making bigger datasets and models, and the rollercoaster will continue.
I very much enjoy the observation that LLM's appear to function optimally when trained on "tokens" and not the pure unfiltered stream of characters. I think I am ultimately attempting to express an analogous belief that the individual audio samples here are as meaningless as the individual letters are to an LLM.
Instead of "representation with the fewest assumptions" I would maybe suggest that the optimal input for a model may be the representation where the data is broken apart as far as it can be while still remaining meaningful. I have suggested in other replies that this is perhaps achieved with quadrature samples or even perhaps with something such as a granular decomposition -- something akin to a "token" of audio instead of language.
Aside from using "waveform" to mean "time domain" the terminology the author used in the blog post is consistent with what I've seen in audio ML research. In what ways would you suggest improving the representation of the signal?
Are you a musician? Have you ever used DAW like Cubase or Pro Tools? If not, have you ever tried the FOSS (GPLv3) Audacity audio editor [1]. Waves and Waveforms are colloquial terminology, so the terms are familiar to anyone in the industry as well as your average hobbyist.
Additionally, PCM [2] is at the heart of many of these tools, and is what is converted between digital and analog for real-world use cases.
This is literally how the ear works [3], so before arguing that this is the "worst possible representation of signal state," try listening to the sounds around you and think about how it is that you can perceive them.
According to your link the ear mostly work in the frequency domain:
Once the vibrations cause the fluid inside the cochlea to ripple, a traveling wave forms along the basilar membrane. Hair cells—sensory cells sitting on top of the basilar membrane—ride the wave. Hair cells near the wide end of the snail-shaped cochlea detect higher-pitched sounds, such as an infant crying. Those closer to the center detect lower-pitched sounds, such as a large dog barking.
Yeah, except no; the ear works by having cillia that each resonate at different frequencies, differentiated into a log-periodic type response. It is mostly a "frequency domain" mechanism though in the real world the time component is obviously necessary to manifest frequency. If we want to have a debate about how best to call it, the closest term I might reach for from the quite often mislabeled vernacular of the music/production/audio world would be "grains" / granular synthesis.
WRT the waveform tool in DAWs you should be aware that it doesnt normally work like you may assume it does. If you start dragging points around in there you typically are not actually doing raw edits to the time domain samples but having your edits applied through a filter that tries to minimize ringing and noise. That is to say the DAW will typically not just let you move a sample to any value you wish. In this case the tool is bending to its use as an audio editor and not defaulting to behavior that would otherwise just introduce clicks and pops every time it was used.
I stand by my argument that the author's terminology appears ignorant in an area where it ought to be very deliberately specific. I question the applicability and relevance of the work beginning at that point, even though the approach may have yielded a useful result.
I'd at least think that quadrature samples would be preferred as they offer instantaneous phase information. I dont think there is anything to be gained by forcing the model to derive this information from the time series data when the computation is so straightforward. Instead of a 48kHz stream of samples you feed it a 24KHz stream of I&Q samples; nothing to it.
I would draw an analogy here between NeRF and Gaussian Splatting -- like ok its great that we can get there with a NN but theres no reason to do that after you have figured out how to optimally compute the result you were after.
I also believe that granular synthesis is a deep well to draw from in this area of research.
Why do people keep doing this? Musicians who want an accompanist/virtual producer still want control over the orchestration, tonality, and shaping of sounds. Even karaoke machines use a signal pipeline to blend the singer's voice with the backing track. Generating finished waveforms is only good for elevator music.
From avant-garde and experimental to soundtracks and commercial electronica, artistis in all kind of genres have used methods, libraries and tools for direct generation of waveforms, whether other processing will happen to them aftewards (manipulations, effects, and so) or they're the final result (there's also a big "generative music" scene as well, both academic and artistic). And that's for decades now. Of course recently hany have also started using AI's to produce generative music - with the API spitting out a final "waveform".
>Even karaoke machines use a signal pipeline to blend the singer's voice with the backing track. Generating finished waveforms is only good for elevator music.
Perhaps you have the kind of music played at the Grand Ole Orpy or something in mind.
Here are some trivial ways to use generated finished waveforms, sticking with the AI case alone:
- take the AI final result, sample it, and use it as you would loops from records or something like Splice.
- train the AI yourself, set parameters, tweek it, and the result is generative music you've produced (a genre that exists since the 60s at least, and is quite the opposite og "elevator music")
- use the generated music as a soundtrack for your film or video or video game
On the loops / sampling front: I always thought RAVE [0][1][2] was a very interesting approach, that really embraces latent spaces and sample/stretch type approaches in the waveform space
Research into "pure" unconditional generation can often lead to gains in the conditional setting. See literally any GAN research, VQ-VAE, VAE, diffusion, etc - all started from "unconditional/low information" pretty much. Both directly (in terms of modeling) and indirectly (by forcing you to really reason about what conditioning is telling you about the modeling, and what's in the data), these approaches really force you to think about what it means to just "make music".
Also, I think artistic uses (such as Dadabots, who heavily used SampleRNN) show clearly that "musicians" like interesting tools, even if uncontrolled in some cases. Tools to exactly execute an idea are important (DAW-like), but so are novelty generating machines like (many) unconditional generators end up being. Jukebox is another nice example of this.
On the "good for elevator music" comment - the stuff I've heard from these models is rarely relaxing enough to be in any elevator I would ride. But there are snippets of inspiration in there for sure.
Generally, I do favor controllable models with lots of input knobs and conditioning for direct use, but there's space for many different approaches in pushing the research forward.
Different creators will work all kind of odd models into their workflows, even things that are objectively less "high quality", and not really controllable. To me, that's a great thing and reason enough to keep pushing unsupervised learning forward.
- Stable Audio: https://stability-ai.github.io/stable-audio-demo/ https://www.stableaudio.com/
- MusicGen: https://ai.honu.io/papers/musicgen/
- MusicLM: https://google-research.github.io/seanet/musiclm/examples/