For any other DSP people tearing their hair out at the author's liberal terminol...

kastnerkyle · on March 26, 2024

The direct counter-argument to "worst representation" is usually "representation with fewest assumptions", waveform as shown here is getting close. Though recording environment, equipment, how the sound actually gets digitized, etc. also come into play, there are relatively few assumptions in the "waveform" setup described here.

I would say in the neural network literature at large, and in audio modeling particularly, this continual back and forth of pushing DSP-based knowledge into neural nets, on the architecture side or data side, versus going "raw-er" to force models to learn their own versions of DSP-style transforms has been and will continue to be a see-saw, as we try to find what works best, driven by performance on benchmarks with certain goals in mind.

These types of push-pull movements also dominate computer vision (where many of the "correct" DSP approaches fell away to less-rigid, learned proxies), and language modeling (tokenization is hardly "raw", and byte based approaches to-date lag behind smart tokenization strategies), and I think every field which approaches learning from data will have various swings over time.

CCD bitstreams are also not "raw", so people will continue to push down in representation while making bigger datasets and models, and the rollercoaster will continue.

gorkish · on March 27, 2024

Yours is the best response to my comment so far.

I very much enjoy the observation that LLM's appear to function optimally when trained on "tokens" and not the pure unfiltered stream of characters. I think I am ultimately attempting to express an analogous belief that the individual audio samples here are as meaningless as the individual letters are to an LLM.

Instead of "representation with the fewest assumptions" I would maybe suggest that the optimal input for a model may be the representation where the data is broken apart as far as it can be while still remaining meaningful. I have suggested in other replies that this is perhaps achieved with quadrature samples or even perhaps with something such as a granular decomposition -- something akin to a "token" of audio instead of language.

kajecounterhack · on March 26, 2024

Aside from using "waveform" to mean "time domain" the terminology the author used in the blog post is consistent with what I've seen in audio ML research. In what ways would you suggest improving the representation of the signal?

jabagonuts · on March 26, 2024

Are you a musician? Have you ever used DAW like Cubase or Pro Tools? If not, have you ever tried the FOSS (GPLv3) Audacity audio editor [1]. Waves and Waveforms are colloquial terminology, so the terms are familiar to anyone in the industry as well as your average hobbyist.

Additionally, PCM [2] is at the heart of many of these tools, and is what is converted between digital and analog for real-world use cases.

This is literally how the ear works [3], so before arguing that this is the "worst possible representation of signal state," try listening to the sounds around you and think about how it is that you can perceive them.

[1] https://manual.audacityteam.org/man/audacity_waveform.html [2] https://en.wikipedia.org/wiki/Pulse-code_modulation [3] https://www.nidcd.nih.gov/health/how-do-we-hear

nick__m · on March 26, 2024

According to your link the ear mostly work in the frequency domain:

  Once the vibrations cause the fluid inside the cochlea to ripple, a traveling wave forms along the basilar membrane. Hair cells—sensory cells sitting on top of the basilar membrane—ride the wave. Hair cells near the wide end of the snail-shaped cochlea detect higher-pitched sounds, such as an infant crying. Those closer to the center detect lower-pitched sounds, such as a large dog barking.

It's really far from PCM.

gorkish · on March 27, 2024

Yeah, except no; the ear works by having cillia that each resonate at different frequencies, differentiated into a log-periodic type response. It is mostly a "frequency domain" mechanism though in the real world the time component is obviously necessary to manifest frequency. If we want to have a debate about how best to call it, the closest term I might reach for from the quite often mislabeled vernacular of the music/production/audio world would be "grains" / granular synthesis.

WRT the waveform tool in DAWs you should be aware that it doesnt normally work like you may assume it does. If you start dragging points around in there you typically are not actually doing raw edits to the time domain samples but having your edits applied through a filter that tries to minimize ringing and noise. That is to say the DAW will typically not just let you move a sample to any value you wish. In this case the tool is bending to its use as an audio editor and not defaulting to behavior that would otherwise just introduce clicks and pops every time it was used.

I stand by my argument that the author's terminology appears ignorant in an area where it ought to be very deliberately specific. I question the applicability and relevance of the work beginning at that point, even though the approach may have yielded a useful result.

sevagh · on March 26, 2024

What signal representation would you prefer they use? Waveform-based models became popular generally _after_ STFT-based models.

gorkish · on March 27, 2024

I'd at least think that quadrature samples would be preferred as they offer instantaneous phase information. I dont think there is anything to be gained by forcing the model to derive this information from the time series data when the computation is so straightforward. Instead of a 48kHz stream of samples you feed it a 24KHz stream of I&Q samples; nothing to it.

I would draw an analogy here between NeRF and Gaussian Splatting -- like ok its great that we can get there with a NN but theres no reason to do that after you have figured out how to optimally compute the result you were after.

I also believe that granular synthesis is a deep well to draw from in this area of research.