We can think of a ML model that takes 1 second of sound as input and produces a vector of fixed length that describes this sound:
S[0..n] = the raw input, 48000 bytes per second of sound
F[1][k..k+48000] -> [0..255], maps 1 second of sound to a "sound vector".
F[2][k..k+96000] -> ..., same, but takes 2 seconds of sound as input
Now instead of the raw input S, we can use the sequences F[1], F[2], etc. Supposedly, F[10] would detect patterns that change every 10 seconds. It's common in soundtracks to have some background "mood" melody that changes a bit every 10-15 seconds, then a more loud and faster melody that changes every 5 seconds and so on, up to some very frequent patterns like F[0.2] that's used in drum'n'bass or electronic music in general.
This is how music is composed by people, I guess. Most of the electronic music can be decomposed into 5-6 patterns that repeat with almost mathematical precision. The artist only randomly changes params of each layer during the soundtrack, e.g. layer #3 with a period of 7 seconds slightly changes frequency for the next 20 seconds, etc.
Masterpieces have the same multilayered structure, except that those subpatterns are more complex.
I'm not an ML guy, so can't say if this is an autoencoder.
We can combine multiple sequences in any way we want. Obviously, we can come up with some nice looking "tower of lstms" where each level of that tower processes the corresponding F[i] sequence: sequence F1 goes to level T1 which is a bunch of LSTMs; then F2 and the output of T1 go to T2 and so on. The only thing that I think matters is (1) feed all these sequences to the model and (2) have enough weights in the model. And obviously a big GPU farm to run experiments.
Ok, but if we are using a hierarchical model like multilayer lstm, shouldn’t we expect it to learn to extract the relevant info at multiple time scales? I mean, shouldn’t the output of T1 already contain all the important info in F2? If not, what extra information do you hope to supply there via F2?
T1 indeed contains all the info needed, but T1 also has limited capacity and can't capture long patterns. T1 would need to have 100s of billions weights to capture minute long patterns. I think this idea is similar to the often used skip connections.
But the job of T1 is not to capture long term patterns, it’s to extract useful short scale features for T2 so that T2 could extract longer term patterns. T3 would hopefully extract even longer scale patterns from T2 output, and so on. That’s the point of having the lstm hierarchy, right?
Why would you try to manually duplicate this process by creating F1, F2, etc?
The idea of skip connections would be like feeding T1 output to T3, in addition to T2. Again, I’m not sure what useful info F sequences would supply in this scenario.
This sounds reasonable, but I think in practice the capacity of T1 won't be enough to capture long patterns and the F2 sequence is supposed to help T2 to restore the lost info about the longer pattern. The idea is to make T1 really good at capturing small patterns, like speech in pop music, while T2 would be responsible for background music with longer patterns.
Don't we already do this with text translation? Why not to let one model read a printed text pixel by pixel and the other model produce a translation, also pixel by pixel? Instead we choose to split printed text into small chunks (that we call words), give every chunk a "word vector" (those word2vec models) and produce text also one word at a time.
S[0..n] = the raw input, 48000 bytes per second of sound F[1][k..k+48000] -> [0..255], maps 1 second of sound to a "sound vector". F[2][k..k+96000] -> ..., same, but takes 2 seconds of sound as input
Now instead of the raw input S, we can use the sequences F[1], F[2], etc. Supposedly, F[10] would detect patterns that change every 10 seconds. It's common in soundtracks to have some background "mood" melody that changes a bit every 10-15 seconds, then a more loud and faster melody that changes every 5 seconds and so on, up to some very frequent patterns like F[0.2] that's used in drum'n'bass or electronic music in general.
This is how music is composed by people, I guess. Most of the electronic music can be decomposed into 5-6 patterns that repeat with almost mathematical precision. The artist only randomly changes params of each layer during the soundtrack, e.g. layer #3 with a period of 7 seconds slightly changes frequency for the next 20 seconds, etc.
Masterpieces have the same multilayered structure, except that those subpatterns are more complex.