Sorry, I just not believe this generalizes in any meaningful sense for arbitrary data.
You cannot determine frequencies from audio PCM data. If you want to build a vector database of audio, with frequency/frequencies as one of the features, at the very least you will have to arrange for a transform to the frequency domain. Unless you claim that a system is somehow capable of discovering fourier's theorem and implementing the transform for itself, this is a necessary precursor to any system being able to embed using a vector that includes frequency considerations.
But ... that's a human decision because humans think that frequencies are important to their experience of music. A person who totally deaf, however, and thus has extremely limited frequency perception, can (often) still detect rythmic structure due to bone conduction. Such a person who was interested in similarity analysis of audio would have no reason to perform a domain transform, and would be more interested in timing correlations that probably could be fully automated into various models as long as someone remembers to ensure that the system is time-aware which is, again, just another particular human judgement regarding what aspects of the audio have significance.
I just read the E5 Mistral paper. I don't see anything that contradicts my point, which wasn't about the need for human labelling, but about the need for human identification of significant features. In the case of text, because of the way languages evolve, we know that a semantic-free character-based analysis will likely bump into lots of interesting syntactic and semantic features. Doing that for arbitrary data (images, sound, air pressure, temperature) lacks any such pre-existing reason to treat the data in any particular way.
Consider, for example, if the "true meaning" of text was encoded in a somewhat Kaballah-esque type scheme, in which far distance words and even phonemes create tangled loops of reference and meaning. Even a system like E5 Mistral isn't going to find that, because that's absolutely not how we consider language to work, and thus that's not part of the fundamentals of how even E5 Mistral operates.
Understanding audio with inputs in the frequency domain isn’t required for understanding frequencies in audio.
A large enough system with sufficient training data would definitely be able to come up with a Fourier transform (or something resembling one), if encoding it helped the loss go down.
> In computer vision, there has been a similar pattern. Early methods conceived of vision as searching for edges, or generalized cylinders, or in terms of SIFT features. But today all this is discarded. Modern deep-learning neural networks use only the notions of convolution and certain kinds of invariances, and perform much better.
Today’s diffusion models learn representations from raw pixels, without even the concept of convolutions.
Ditto for language - as long as the architecture is 1) capable of modeling long range dependencies and 2) can be scaled reasonably, whether you pass in tokens, individual characters, or raw ASCII bytes is irrelevant. Character based models perform just as well (or better than) token/word level models at a given parameter count/training corpus size - the main reason they aren’t common (yet) is due to memory limitations, not anything fundamental.
You cannot determine frequencies from audio PCM data. If you want to build a vector database of audio, with frequency/frequencies as one of the features, at the very least you will have to arrange for a transform to the frequency domain. Unless you claim that a system is somehow capable of discovering fourier's theorem and implementing the transform for itself, this is a necessary precursor to any system being able to embed using a vector that includes frequency considerations.
But ... that's a human decision because humans think that frequencies are important to their experience of music. A person who totally deaf, however, and thus has extremely limited frequency perception, can (often) still detect rythmic structure due to bone conduction. Such a person who was interested in similarity analysis of audio would have no reason to perform a domain transform, and would be more interested in timing correlations that probably could be fully automated into various models as long as someone remembers to ensure that the system is time-aware which is, again, just another particular human judgement regarding what aspects of the audio have significance.
I just read the E5 Mistral paper. I don't see anything that contradicts my point, which wasn't about the need for human labelling, but about the need for human identification of significant features. In the case of text, because of the way languages evolve, we know that a semantic-free character-based analysis will likely bump into lots of interesting syntactic and semantic features. Doing that for arbitrary data (images, sound, air pressure, temperature) lacks any such pre-existing reason to treat the data in any particular way.
Consider, for example, if the "true meaning" of text was encoded in a somewhat Kaballah-esque type scheme, in which far distance words and even phonemes create tangled loops of reference and meaning. Even a system like E5 Mistral isn't going to find that, because that's absolutely not how we consider language to work, and thus that's not part of the fundamentals of how even E5 Mistral operates.