Generating music in the waveform domain (2020)

albertzeyer · on March 26, 2024

Since 2020, there are a number of new models, for example:

- Stable Audio: https://stability-ai.github.io/stable-audio-demo/ https://www.stableaudio.com/

- MusicGen: https://ai.honu.io/papers/musicgen/

- MusicLM: https://google-research.github.io/seanet/musiclm/examples/

p1esk · on March 26, 2024

But none of these is a significant improvement over Jukebox. I think at this point everyone is waiting for Jukebox2.

spyder · on March 26, 2024

Stable-audio and MusicGen sounds better than Jukebox.

But the best so far is Suno.ai ( https://app.suno.ai ) especially with their V3 model they have very impressive results, the fidelity is not studio quality but they're getting very close.

It's very likely based on their TTS model they have released before (Bark), but trained on more data and with higher resolution.

https://github.com/suno-ai/bark

cjones26 · on March 27, 2024

Absolutely incredible: https://app.suno.ai/song/62ebaa5c-1651-430a-a034-ae2c75e61c2...

GaggiX · on March 26, 2024

SunoAI is very good in my opinion.

p1esk · on March 26, 2024

I tried a dozen of different prompts in suno ai to generate some music - it completely ignored them. It just generated some simple pop sounding tunes every time. I’m not impressed.

zaptrem · on March 26, 2024

try this one my friend and I are training: https://sonauto.ai/

audio quality is still WIP and UI is ugly but prompting should allow more detail.

e.g., 70s Dolly Parton-style song rando made earlier today: https://sonauto.ai/songs/t4XGcHqsYwS6B73MjtV5

throwway120385 · on March 26, 2024

The lyrics on those songs are basically a pastiche of everything popular. But just based on sound it's pretty convincing. I bet you could train a generative AI to push out trance bangers as long as they don't have any lyrics.

zaptrem · on March 26, 2024

I mean the user wrote the lyrics (or maybe ChatGPT lol). People of course write more interesting ones (Just found a Ukrainian about Kyiv: https://sonauto.ai/songs/xRDqe57ZgT6QrFXiIzF0). There was another one about Frank Sinatra being attacked with a baseball bat but I can't find it rn.

bradgpt · on March 27, 2024

this is cool! i've played with stable audio a little but it doesn't do lyrics, how are you getting the vocals so good?

GaggiX · on March 27, 2024

I have seen people create quite a variety of music styles on SunoAI. Are you using v3?

xotesos · on March 27, 2024

I just tried Suno and the results to me are terrible.

It seems designed for making pop music no one will listen to.

I have spent many hours with MusicLM making wild experimental music no one will listen to.

MusicLM has no problem making really weird sound combinations.

I just gave SunoAI some of my MusicLM prompts I have saved and the results are garbage. The problem with the AI test kitchen model though is the results sound like they are in mono.

The ultimate for me will be when we can make rap/hiphop no one will listen to.

munificent · on March 26, 2024

The two piano examples where the second one has the phase randomized is also an excellent example of why allpass filters, which change the phase but not amplitude of all frequencies, are a building block for digital reverbs. The second piano example with the randomized phases sounds more blurred out and almost reverb-y.

hammock · on March 26, 2024

>change the phase but not amplitude of all frequencies

Kind of what walls do as well so it makes sense

tgv · on March 26, 2024

Isn't that usually called the time domain?

DiogenesKynikos · on March 26, 2024

The author makes a distinction between two different modeling approaches:

1. Representing music as a series of notes (with additional information about dynamics, etc.), and then building a model that transforms this musical score into sound.

2. Modeling the waveform itself directly, without reference to a score. This waveform could presumably be in either the frequency or the time domain, but the author chooses to use the time domain.

The author's terminology is a bit confusing, but I think that they mean option 2 when they say "the waveform domain."

earthnail · on March 26, 2024

The frequency domain is a misleading name. I assume you’re referring to STFT, or spectrogram, which is a series of windowed time segments transformed into the frequency domain.

But both you and OP are right in that waveform domain is usually called time domain.

blt · on March 26, 2024

People working on this problem: have diffusion models taken off in this field too?

p1esk · on March 26, 2024

Yes: https://github.com/riffusion/riffusion

zaptrem · on March 26, 2024

yeah, we're training a diffusion model for this here: https://sonauto.ai/

example 70s dolly parton song made by a rando earlier today: https://sonauto.ai/songs/t4XGcHqsYwS6B73MjtV5

xotesos · on March 27, 2024

I hope this isn't a sign of what is to come in music.

We can do anything imaginable with AI and instead we just make really shitty Dolly Parton music reboots.

Maybe we can do AI Tom Jones too and form a whole genre of shitty AI 70s music.

zaptrem · on March 27, 2024

Agreed. The above is essentially a tech demo while we work on the more exciting/fulfilling uses we're planning for these models.

brcmthrowaway · on March 26, 2024

Any dummies explanation of what diffusion is?

benanne · on March 26, 2024

I've since moved on to work primarily on diffusion models, so I have a series of blog posts about that topic as well!

- https://sander.ai/2022/01/31/diffusion.html is about the link between diffusion models and denoising autoencoders, IMO the easiest to understand out of all interpretations; - https://sander.ai/2023/07/20/perspectives.html covers a slew of different perspectives on diffusion models (including the "autoencoder" one).

In a nutshell, diffusion models break up the difficult task of generating natural signals (such as images or sound) into many smaller partial denoising tasks. This is done by defining a corruption process that gradually adds noise to an input until all of the signal is drowned out (this is the "diffusion"), and then learning how to invert that process step-by-step.

This is not dissimilar to how modern language models work: they break up the task of generating text into a series of easier next-word-prediction tasks. In both cases, the model only solves a small part of the problem at a time, and you apply it repeatedly to generate a signal.

gorkish · on March 26, 2024

For any other DSP people tearing their hair out at the author's liberal terminology, whenever you see "wave" or "waveform" the author is talking about non-quadrature time domain samples.

I feel this work would be a lot better if it was built around a more foundational understanding of DSP before unleashing the neural nets on data that is arguably the worst possible representation of signal state. But then again there are people training on the raw bitstream from a CCD camera instead of image sequences so maybe hoping for an intermediate format that can be understood by humans has no place in the future!

kastnerkyle · on March 26, 2024

The direct counter-argument to "worst representation" is usually "representation with fewest assumptions", waveform as shown here is getting close. Though recording environment, equipment, how the sound actually gets digitized, etc. also come into play, there are relatively few assumptions in the "waveform" setup described here.

I would say in the neural network literature at large, and in audio modeling particularly, this continual back and forth of pushing DSP-based knowledge into neural nets, on the architecture side or data side, versus going "raw-er" to force models to learn their own versions of DSP-style transforms has been and will continue to be a see-saw, as we try to find what works best, driven by performance on benchmarks with certain goals in mind.

These types of push-pull movements also dominate computer vision (where many of the "correct" DSP approaches fell away to less-rigid, learned proxies), and language modeling (tokenization is hardly "raw", and byte based approaches to-date lag behind smart tokenization strategies), and I think every field which approaches learning from data will have various swings over time.

CCD bitstreams are also not "raw", so people will continue to push down in representation while making bigger datasets and models, and the rollercoaster will continue.

gorkish · on March 27, 2024

Yours is the best response to my comment so far.

I very much enjoy the observation that LLM's appear to function optimally when trained on "tokens" and not the pure unfiltered stream of characters. I think I am ultimately attempting to express an analogous belief that the individual audio samples here are as meaningless as the individual letters are to an LLM.

Instead of "representation with the fewest assumptions" I would maybe suggest that the optimal input for a model may be the representation where the data is broken apart as far as it can be while still remaining meaningful. I have suggested in other replies that this is perhaps achieved with quadrature samples or even perhaps with something such as a granular decomposition -- something akin to a "token" of audio instead of language.

kajecounterhack · on March 26, 2024

Aside from using "waveform" to mean "time domain" the terminology the author used in the blog post is consistent with what I've seen in audio ML research. In what ways would you suggest improving the representation of the signal?

jabagonuts · on March 26, 2024

Are you a musician? Have you ever used DAW like Cubase or Pro Tools? If not, have you ever tried the FOSS (GPLv3) Audacity audio editor [1]. Waves and Waveforms are colloquial terminology, so the terms are familiar to anyone in the industry as well as your average hobbyist.

Additionally, PCM [2] is at the heart of many of these tools, and is what is converted between digital and analog for real-world use cases.

This is literally how the ear works [3], so before arguing that this is the "worst possible representation of signal state," try listening to the sounds around you and think about how it is that you can perceive them.

[1] https://manual.audacityteam.org/man/audacity_waveform.html [2] https://en.wikipedia.org/wiki/Pulse-code_modulation [3] https://www.nidcd.nih.gov/health/how-do-we-hear

nick__m · on March 26, 2024

According to your link the ear mostly work in the frequency domain:

  Once the vibrations cause the fluid inside the cochlea to ripple, a traveling wave forms along the basilar membrane. Hair cells—sensory cells sitting on top of the basilar membrane—ride the wave. Hair cells near the wide end of the snail-shaped cochlea detect higher-pitched sounds, such as an infant crying. Those closer to the center detect lower-pitched sounds, such as a large dog barking.

It's really far from PCM.

gorkish · on March 27, 2024

Yeah, except no; the ear works by having cillia that each resonate at different frequencies, differentiated into a log-periodic type response. It is mostly a "frequency domain" mechanism though in the real world the time component is obviously necessary to manifest frequency. If we want to have a debate about how best to call it, the closest term I might reach for from the quite often mislabeled vernacular of the music/production/audio world would be "grains" / granular synthesis.

WRT the waveform tool in DAWs you should be aware that it doesnt normally work like you may assume it does. If you start dragging points around in there you typically are not actually doing raw edits to the time domain samples but having your edits applied through a filter that tries to minimize ringing and noise. That is to say the DAW will typically not just let you move a sample to any value you wish. In this case the tool is bending to its use as an audio editor and not defaulting to behavior that would otherwise just introduce clicks and pops every time it was used.

I stand by my argument that the author's terminology appears ignorant in an area where it ought to be very deliberately specific. I question the applicability and relevance of the work beginning at that point, even though the approach may have yielded a useful result.

sevagh · on March 26, 2024

What signal representation would you prefer they use? Waveform-based models became popular generally _after_ STFT-based models.

gorkish · on March 27, 2024

I'd at least think that quadrature samples would be preferred as they offer instantaneous phase information. I dont think there is anything to be gained by forcing the model to derive this information from the time series data when the computation is so straightforward. Instead of a 48kHz stream of samples you feed it a 24KHz stream of I&Q samples; nothing to it.

I would draw an analogy here between NeRF and Gaussian Splatting -- like ok its great that we can get there with a NN but theres no reason to do that after you have figured out how to optimally compute the result you were after.

I also believe that granular synthesis is a deep well to draw from in this area of research.

anigbrowl · on March 26, 2024

Why do people keep doing this? Musicians who want an accompanist/virtual producer still want control over the orchestration, tonality, and shaping of sounds. Even karaoke machines use a signal pipeline to blend the singer's voice with the backing track. Generating finished waveforms is only good for elevator music.

coldtea · on March 26, 2024

This is not even wrong.

From avant-garde and experimental to soundtracks and commercial electronica, artistis in all kind of genres have used methods, libraries and tools for direct generation of waveforms, whether other processing will happen to them aftewards (manipulations, effects, and so) or they're the final result (there's also a big "generative music" scene as well, both academic and artistic). And that's for decades now. Of course recently hany have also started using AI's to produce generative music - with the API spitting out a final "waveform".

>Even karaoke machines use a signal pipeline to blend the singer's voice with the backing track. Generating finished waveforms is only good for elevator music.

Perhaps you have the kind of music played at the Grand Ole Orpy or something in mind.

Here are some trivial ways to use generated finished waveforms, sticking with the AI case alone:

- take the AI final result, sample it, and use it as you would loops from records or something like Splice.

- train the AI yourself, set parameters, tweek it, and the result is generative music you've produced (a genre that exists since the 60s at least, and is quite the opposite og "elevator music")

- use the generated music as a soundtrack for your film or video or video game

kastnerkyle · on March 26, 2024

On the loops / sampling front: I always thought RAVE [0][1][2] was a very interesting approach, that really embraces latent spaces and sample/stretch type approaches in the waveform space

[0] https://github.com/acids-ircam/RAVE?tab=readme-ov-file

[1] https://www.youtube.com/watch?v=dMZs04TzxUI

[2] https://www.youtube.com/watch?v=jAIRf4nGgYI

kastnerkyle · on March 26, 2024

Research into "pure" unconditional generation can often lead to gains in the conditional setting. See literally any GAN research, VQ-VAE, VAE, diffusion, etc - all started from "unconditional/low information" pretty much. Both directly (in terms of modeling) and indirectly (by forcing you to really reason about what conditioning is telling you about the modeling, and what's in the data), these approaches really force you to think about what it means to just "make music".

Also, I think artistic uses (such as Dadabots, who heavily used SampleRNN) show clearly that "musicians" like interesting tools, even if uncontrolled in some cases. Tools to exactly execute an idea are important (DAW-like), but so are novelty generating machines like (many) unconditional generators end up being. Jukebox is another nice example of this.

On the "good for elevator music" comment - the stuff I've heard from these models is rarely relaxing enough to be in any elevator I would ride. But there are snippets of inspiration in there for sure.

Generally, I do favor controllable models with lots of input knobs and conditioning for direct use, but there's space for many different approaches in pushing the research forward.

Different creators will work all kind of odd models into their workflows, even things that are objectively less "high quality", and not really controllable. To me, that's a great thing and reason enough to keep pushing unsupervised learning forward.

alexahn · on March 26, 2024

Seems like a reasonable way to try to design an AGI. Maybe the real Turing test is whether an intelligent system enjoys and seeks out novel music.

p1esk · on March 26, 2024

A lot of humans would fail such a test.

datashaman · on March 26, 2024

Most. Pop can't be novel.

JadeNB · on March 27, 2024

> Most. Pop can't be novel.

Pop, etymologically, can't be unpopular, and what's truly novel usually isn't popular, but I don't think it's true that what's popular can't be novel.