You would think getting the notes and converting them to chords, and tablature, would be one of those "exactly suited for neural nets" type of problem.
If you want the chords to "Here comes the sun," you can find dozens of hits, but try something slightly obscure and they are hard to come by. (People with great ears have no idea what I am talking about.)
I've worked on this problem for some time on a personal project, and I'm pretty convinced you can basically solve this problem without deep learning or AI techniques, and instead use non-negative matrix factorization[0] as a bank of note templates (from their spectrograms). I have a fairly well working proof of concept and the approach is supported by the literature.
As an amateur musician I see this as a missing holy grail killer app. I'd love to have it just to pull the chords out of some of my own old recordings that I can't figure out how to repeat.
If anyone knows of any apps (even prototypes) that can do this, please provide links.
There is a reference describing the Chordino method in the page linked above, but roughly it's not too far from the description of the parent poster - a non-negative least-squares method produces a frame-by-frame semitone-scaled decomposition which is then matched against templates and turned into a chord sequence using a hidden Markov model. Some sort of intelligent smoothing like this is certainly needed, the raw template matching is not especially useful on its own.
This type of method is now routinely outperformed by neural networks (see e.g. these MIREX evaluation results from 2018 which compare Chordino with a few other academic methods https://www.music-ir.org/mirex/wiki/2018:Audio_Chord_Estimat...) but I would suggest that it's still good enough to be useful - and I encourage the parent to continue their work, as it's an interesting area to explore.
Success! Per your suggestion, I used Chordino plugin with Sonic Visualizer. I pumped in one of my old recordings and it showed me "B6" and "Dmaj7".
However, this was only so helpful. It would have been better if it showed me a fretboard and lit up which notes are active. Instead it just showed me e.g. "B6" which unfortunately has multiple implementations so I had to try many of the implementations across many capo positions, and it ended up being a "B6" that I don't even see documented in any chord guides, probably because I was using a capo. I was eventually able to find it by guess and check: randomly moving my fingers and capo around the fretboard then if it sounds close look it up to make sure it's "B6" in a reverse chord finder (i.e. oolimo.de). Still pretty painful for amateur me.
Perhaps the fact that a given chord has multiple implementations makes it impossible for the analyzer to know which one I'm using, but in my case I'm strumming all 6 strings so I suspect it's doable. Do you know if any tools can show the results as dots-on-fretboard? Or maybe I need a more thorough reverse chord finder?
Ableton Live has a Audio to Midi [1] feature that works pretty well to extract notes from audio. Melodyne [2] is a very powerful tool to mess around with notes in general.
There's also some open source projects out there, such as aubio [3] and Omnizart [4].
Automated music transcription - the process of generating a musical score from an audio recording - is a pretty active topic in signal processing (and has been for a couple of decades at least).
The monophonic case (one note at a time) is fairly well solved at this point: there are decades old solutions in both the frequency domain (like FFT) and time domain (like auto-correlation and the dozens of refinements of that basic concept) that work quite well under less than ideal conditions and in near real-time. Even naive solution like just counting the number of zero-crossings in the audio signal to estimate the fundamental frequency works pretty well.
The polyphonic case (like chord detection) is trickier, especially depending on what you're looking for exactly. I.e., is it sufficient to say "that's a C Major chord" or are you looking for a specific inversion or even fingering? Does it need to happen in real-time based off a microphone or could you batch-process an audio file instead?
But there are both academic solutions and consumer-oriented tools that can do a reasonable job of it (again, depending on what you're looking for).
If you're looking for guitar-chord detection in particular, I'd recommend you take a look at Chordify (https://chordify.net/). I'm even the developer of a product that competes with (or at least overlaps with) Chordify, but frankly it pretty much does what it says on the tin (extracts chords from audio recordings with more than acceptable fidelity, especially if you're willing and able to refine that by ear using the automated transcription as a starting point).
I'm pretty sure Chorify's solution is based on "deep learning" (ANN) techniques, but others have noted in this thread that's not the only viable way to do it. I suspect some combination of increasing computational power and algorithmic refinements will eventually lead to a "direct analysis" approach that becomes as common/conventional for polyphonic pitch detection as FFT and AC are for the monophonic case. There are already a number of fairly effective techniques depending on the constraints you want to put on the problem.
> The polyphonic case (like chord detection) is trickier
Very true, but for practical purposes chord detection is easier than polyphonic note transcription - it isn't necessary to transcribe all the notes with perfect fidelity to identify a likely chord, and there are many issues around note timing that become simpler when you assume one chord at a time.
> I'm pretty sure Chorify's solution is based on "deep learning" (ANN) techniques
At least at launch, I believe they were using a method more like that of Chordino - in fact using the same chromagram decomposition - but with a more sophisticated language model for chord transitions than Chordino's HMM.
This is the right approach for song recommendation too — try your approach there and see what happens. If you want help on the business side, reach out.
Can you expand on what you mean by "song recommendations" in this context? Do you mean recommendations like "if you like X you might also like Y"?
Assuming the answer is "yes", I'm not sure if I follow the leap from "non-negative matrix factorization as a bank of note templates (from their spectrograms)" to "song recommendations".
Very loosely speaking my (not wholly uninformed) interpretation of the "note templates" bit is sorta analogous to DFFT analysis with frequency bins centered around the "regular" notes of the chromatic scale - i.e., the cells of the matrix represent signal strength for the frequencies that correspond to conventional western notes (e.g. integer-valued midi note numbers) or (in aggregated form) octave-independent pitch-classes rather than arbitrary frequencies (that might fall between two conventional notes). It's a very useful representation of the component frequencies that appear in conventional music but at the end of the day it's more or less "here are the notes (or pitch classes) active for this given beat".
That is, I can imagine the role this information - essentially the musical score itself, or at least the pitch-specific dimension of that score - could play in a song-recommendation engine, but I'm curious how/why that specific spectrogram-template-based representation is significant.
Are you suggesting that that specific representation could be applied to song recommendations in a way that similar polyphonic-pitch-per-beat information derived from some other algorithm (FFT analysis for a hypothetical example) could not?
Or maybe I've misinterpreted your comment entirely?
More likely, they mean using non-negative matrix factorization but with a bank of feature vectors instead of note templates. NNMF can be used in a wide variety of domains because it essentially encodes the problem of "this thing is a bit like this thing, a bit like this thing, and a bit like this other thing".
If instead of numbers representing intensity at different frequencies (as in the spectrograms), the numbers in each vector of the template bank represent other features (such as listener overlap with other artists/songs, or genre representation across multiple continuous "color" axes) then you can recommend music to a listener based on the similarity to songs in their library to those in the template bank.
You're already leaps and bounds into a good partnership - realistic expectations. Consider it again when you get downtime, no rush over here, more fun.
I've also spent a fair bit of time on this topic and for what it's worth I agree with you. It is a harder problem than the monophonic case (and more sensitive to problems like noise under real-world conditions) but you don't strictly need deep learning or AI techniques to solve it.
I mean, computational complexity aside it seems like at least hypothetically you could even just apply basic auto-correlation-style logic to detect the period of the combined wave much like you do in the monophonic case (assuming the chord is sustained for long enough to actually capture that full period, which of course it won't be in the general case). There's nothing magical about a neural-net or other deep-learning-style solution to this problem - at the end of the day that's just an approximation of a formula that could in theory be derived through more direct means anyway. And (as far as I know) there's no reason to believe the polyphonic case is fundamentally resistant to more traditional techniques.
And as implied by your comment, the problem is made easier (or at least less resource-intensive) in practice than it is in the abstract: we're mostly interested in audio that's comprised of actual notes from the chromatic scale (rather than a combination of arbitrary frequencies). There's only ~140 or so component frequencies we really need to consider in practice. (Not to mention the semi-predictable repetition/progression patterns you're likely to encounter in most conventional songs. That's inadequate by itself but certainly a good way to error correct, fill in gaps, resolve ambiguous cases, etc.)
But that said, it does seem like polyphonic pitch detection is a problem that responds really well to machine-learning techniques. In my experience, even a fairly simplistic ANN (e.g., no hidden layers, ~1k to ~10k weights depending upon how the inputs/outputs are modeled) - when seeded with a little bit of domain-specific knowledge - can very quickly learn to perform reliable polyphonic pitch detection under real-world conditions.
To be fair, I haven't quite put my money where my mouth is on this topic (yet): I develop software that includes this sort of functionality and the current production version uses more conventional (or at least direct) analysis rather than so called "deep learning" techniques for polyphonic pitch detection. There are pros and cons to either approach, but I can definitely see why some find the deep learning solution attractive. There's probably some degree of magical thinking involved (i.e., "AI will solve this pattern recognition problem that's too hard for me to work out from first principles"), but it also seems to work really well in this case.
For what it's worth I think you've got the right general idea, or at least (based on your brief description) I think I arrived at a solution that's based on some similar concepts and found it fairly effective (beyond the proof-of-concept phase). And as you noted there are related concepts discussed in some of the published academic research. I'd love to hear a little more about your approach if you're willing and able to share any more details. (Noting that at least part of my interest in that topic is selfish, of course.)
I don't think finding individual notes / the inherent notes within a chord is likely that difficult. The problem with this sort of thing is the nuance that's really involved - a player with a decent ear will be able to tell if the same A note is played on an open string, on the lower, fatter strings, or further up the neck. With chords, you also need to start considering voice leading (where you're specifically picking chord inversions for melody, which on guitar will affect how / where you play a chord) and how the mechanics fit together overall (I probably don't want to play an open C chord followed by something on the 11th fret, for example).
Aren't there a fair amount of songs where the guitar is in some non-standard tuning or played with a kapo, etc? I imagine a naive ML approach would spit out some tabs that would be difficult to play without that context.
In addition to that, the same set of notes can represent different chords depending on the context and the chord's function, so instead of just picking up notes and recognizing them, the software has to actually understand the chord progression and the harmony.
Yeah exactly what I was going to say, you have to consider the chord voicing, the strum pattern, and whether the sequence of chords makes sense for a human to comfortably play.
Someone above mentioned: "Here comes the sun", which is a great example.
First gotcha, it sort of sounds finger picked, but it's not it's flat picked. Another unusual thing about that song is that it's played w/ a capo on 7th fret.
Now I'm sure some software could figure out the individual notes, but I wouldn't be surprised if it transcribed it as nonsense sequence of notes all over on the g, b and e strings, instead of arpeggiated chords as it really is played.
Thus propagating the NN "If all you have is a hammer" trend. I don't quite know the nuance of music theory, but could you not get away with traditional Fourier analysis? You just need to decompose the song into its constituent frequency "bins" right?
The trouble with analysing "notes" in a composition (or even just with a polyphonic instrument) is the pesky harmonics. The timbre of different instruments produce harmonics that happen to be the fundamental frequencies of other notes.
Would it be possible to eke out the harmonics by saying (for 'x' type of guitar tuned in 'y' way with 'z' effect, harmonics look like this). Like if the fundamental frequency is f, then the whole note looks like 0db at f and -5db at 2f and -10db at 3f or something. Then, when you're looking at the frequency domain, you start from the lower notes and say "hmm, looks like there's a fundamental at f, are the expected harmonics there"? If yes, that's the note, if no, it's something else.
It is absolutely possible (and effective, depending on what you consider to be an adequate ROI) to apply this sort of heuristic to this domain.
The "holy grail" of a universal AMT that works with any number of instruments of any type played concurrently isn't exactly an intractable problem to begin with, but if you constrain the problem in various ways (to specific instruments, to known tunings, etc.) you can definitely take advantage of a priori knowledge about the "timbre" of the instrument and the way in which the sound wave evolves over the duration of the note/notes to work-around what would otherwise be more ambiguous data. The octave/harmonics problem is one example of the kind of problem that is much easier to eliminate (relative to the abstract case) if you can make assumptions about the type of instrument that is generating the sound.
The overtones generated by the vibrations of a guitar string (for example) follow a fairly specific and distinctive pattern. If you dig a little bit into the physics/mechanics by which a given instrument generates sound there is a lot of tell-tale information to take advantage of.
This is basically how you'd do it with non-negative matrix factorization. You take the spectrogram of a bunch of known notes for that type of guitar, and store them in a template library (just a vector of vectors, where each inner vector is a spectrogram of that note). Then NNMF determines how much of each template contributes to some given signal. The templates are the "harmonics look like this" thing you're talking about. It works pretty well.
By the way, the reason it is easier to use NNMF than try to implement your suggestion as a heuristic is because there's much more overlap between the different notes than you might think, and (worst of all) the timbre of the note:
1. evolves over time!
2. depends on the velocity of the note (how hard the string was strum)
3. and the notes actually interact with each other. If you play an E, the A string will mildly reverberate too because of the shared harmonics
No, at the very least you also need to do the instrument separation, e.g. bass might match guitar chord root note at one moment, and do something entirely different half a second later, and of course in a coherent song all the different instruments will merge in the same frequency bins, either directly or with their overtones. Also, for certain instruments and effects on e.g. guitar amps the strongest peaks of Fourier transform for the instrument output may not necessarily match the notes that are actually being played.
As other commenters have expounded on, the short answer is "no". While I definitely agree that one must be careful about falling into the trap of thinking everything is a nail to be hammered with NNs, it's also pretty common to fall into the "this is easy, why don't you just..." trap for things that humans do with (relative) ease.
Well the NN has to have something to operate on, and I think a Fourier analysis may help as an input.
There is just too much going on with a track. Its really strange for us to say this mash up of sounds is "really" A-D-E at the heart of it, and when I play these chords it will suggest that wall of sound you heard on the record. The net is just capturing our biases.
Ah yes, Rocksmith+ that requires a subscription to use, and will not work if you are not connected to the internet (i.e. if your subscription lapses, your purchase is worthless). It's now been delayed until 2022, I wonder if the feedback made them realize that while the board may love the idea of mandatory subscriptions, customers hate it.
They have two modes:
1. Transcribed tabs. They own the transcription as they have done with their catalogue in existing Rocksmith games. I believe they claim they use some “AI assistance” here, but I could be wrong. At the end of the day, they pay a large number of musicians to chew through songs and write tabs.
2. Chord following. They use automated transcription to get the chords for a very large number of songs. This is a new mode in the newest game and is entirely automated, but limited.
With OG Rocksmith (and 2014), the charts were supposedly all done by the folks at Ubisoft. They'd watch live videos and have multiple go-overs with each other to make it as accurate as possible. Besides, "official" tabs (at least historically) were usually not done by the actual artists, but some company that was just given the recordings and paid to figure it out.
If you want the chords to "Here comes the sun," you can find dozens of hits, but try something slightly obscure and they are hard to come by. (People with great ears have no idea what I am talking about.)