It was really hard to do the first time. :) I'm honored to have been part of the first team to do any viable acoustic music recognition, in 2001 (much earlier than Shazam, a point of pride of course[0]).
You're dead on that it's pretty difficult if you don't benefit from others, we did a ton of work that in retrospect wasn't necessary. I liked the advanced psychoacoustic model, faithfully implemented in high performant C direct from Zwicker. (Psychoacoustics). To a first approximation, about 10/s model -> pca -> top 16 dim -> VQ and the resulting bytes contain more than 50% of the entropy (!!) Shove all of those in a home grown what-you-now-call-a vector DB, do dozens of range queries, and search for any song common to multiple results. Boom, music recognition. Understandable in retrospect but things like that aren't Everest they're like... multiple unclimbed mountains.
0. And far too early to have any applications. Company existed 2000-2001 \o/
You're dead on that it's pretty difficult if you don't benefit from others, we did a ton of work that in retrospect wasn't necessary. I liked the advanced psychoacoustic model, faithfully implemented in high performant C direct from Zwicker. (Psychoacoustics). To a first approximation, about 10/s model -> pca -> top 16 dim -> VQ and the resulting bytes contain more than 50% of the entropy (!!) Shove all of those in a home grown what-you-now-call-a vector DB, do dozens of range queries, and search for any song common to multiple results. Boom, music recognition. Understandable in retrospect but things like that aren't Everest they're like... multiple unclimbed mountains.
0. And far too early to have any applications. Company existed 2000-2001 \o/