Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Instead of feeding it raw sound samples, we could split the input into 16 sec, 8 sec, 4 sec and so on slices, assign each slice a "sound vector" serving as a short description of that slice and let the generator take those sound vectors as input.

I didn’t quite get it. How would you feed this variable sized input?



To illustrate more this idea, let's use soundtrack v=negh-3hi1vE on youtube. Such soundtracks consist of multiple more or less repeating patterns. The period of each pattern is different: some background pattern that sets the mood of the music may have a long period of tens of seconds. The primary pattern that's playing right now has a short period of 0.25 seconds, plays for a few seconds and then fades off. The idea is to split the soundtrack into 10 second chunks and map each chunk to a vector of a fixed size, say 128. The same thing we do with words. Now we have a sequence of shape (?, 128) that can be theoretically fed into a music generator and as long as we can map such vectors back to 10 second sound chunks, we can generate music. Then we introduce a similar sequence that splits the soundtrack into 5 second chunks. Then another sequence for 2.5 seconds chunks and so on. Now we have multiple sequences that we can feed to the generator. Currently we take 1/48000th second slices and map them to vectors, but that's about as good as trying to generate meaningful text by drawing it pixel by pixel (which we can surely do and the model will have 250 billion weights and take 2 million years to train on commodity hardware).


How would you map these chunks to vectors?


The same way we map words to vectors or entire pictures to vectors. We'll have another ML model that would take 1 second of sound as input (48000 1 byte numbers) and produce a say vector of 128 float32 numbers that would "describe" this 1 second of sound.


What would be an equivalent of a word for music?


1 second of sound. Or a few seconds of sound.


This would rule out such common mapping methods as word2vec, because unlike words, vast majority of 1 sec chunks of audio would be unique (or only repeating within a single recording).


That's fine. The goal is to map "similar" 1 second chunks to similar vectors. I'm sure this can be done and uniqueness of sound won't be a problem.


Sure, we can probably find a way to map two similar chunks to two similar vectors. However, with 1:1 mapping the resulting vectors will be just as unique. That's a problem, because, if you recall, we want to predict the next unit of music based on the units the model has seen so far. Training a model for this task requires showing it sequences of encoded units of music (vectors), where we must have many examples of how a particular vector follows a combination of particular vectors. If most of our vectors are unique, we won't have enough examples to train the model. For example, showing the model multiple examples of a phrase "I'm going to [some verb]", it will eventually learn that "to" after "I'm going" is quite likely, that a verb is more likely after "to" than an adjective, etc. This wouldn't have happened if the model saw 'going' or 'to' only once during training.


Can we diff spectrograms to define the "distance" between two chunks of sound and use this measure to guide the ML learning process?

Would it help to decompose sound into subpatterns with Fourier transform?

Afaik, there is a similar technique for recognizing faces: a face picture is mapped to a "face vector". Yet this technique doesn't need the notion of "sequence of faces" to train the model. Can we use it to get "sound vectors"?


How would you use spectrogram diffs for training?

I'm not sure what would be useful "subpatterns" of sound. In language modeling, there are word based, and character based models. Given enough text, an RNN can be trained on either, and I'm not sure which approach is better. For music the closest equivalent of a word is (probably) a chord, and the closest equivalent of a character is (probably) a single note, but perhaps it should be something like a harmonic, I don't know.

Unlike faces, music is a sequence (of sounds). It's closer to video than to an image. So we need to chop it up and to encode each chunk.

Ultimately, I believe that we just need a lot of data. Given enough data, we can train a model which is large enough to learn everything it needs in the end to end fashion. Primary achievement of GPT-2 paper is training a big model on lots of data. In this work, it appears they only used a couple of available midi datasets for training, which is probably not enough. Training on all available audio recordings (either raw, or converted to symbolic format) would probably be a game changer.


The same way we feed the variable size sequence of characters or sound samples into this RNN. Instead of raw samples at the 16 kHz rate, we'll have one sequence of 1 sample per second, another sequence of 1 sample per 0.5 seconds and so on. We can go as far as 1 sample per 1/48000 sec, but I don't think it's practical (but this is what these music generators do).


What do you mean by “sample” when you say “sequence of 1 sample per second”?


We can think of a ML model that takes 1 second of sound as input and produces a vector of fixed length that describes this sound:

S[0..n] = the raw input, 48000 bytes per second of sound F[1][k..k+48000] -> [0..255], maps 1 second of sound to a "sound vector". F[2][k..k+96000] -> ..., same, but takes 2 seconds of sound as input

Now instead of the raw input S, we can use the sequences F[1], F[2], etc. Supposedly, F[10] would detect patterns that change every 10 seconds. It's common in soundtracks to have some background "mood" melody that changes a bit every 10-15 seconds, then a more loud and faster melody that changes every 5 seconds and so on, up to some very frequent patterns like F[0.2] that's used in drum'n'bass or electronic music in general.

This is how music is composed by people, I guess. Most of the electronic music can be decomposed into 5-6 patterns that repeat with almost mathematical precision. The artist only randomly changes params of each layer during the soundtrack, e.g. layer #3 with a period of 7 seconds slightly changes frequency for the next 20 seconds, etc.

Masterpieces have the same multilayered structure, except that those subpatterns are more complex.


We can think of a ML model that takes 1 second of sound as input and produces a vector of fixed length that describes this sound

You mean like an autoencoder?

Ok, assuming we have those sequences (F1, F2, F10, etc), how would you combine them to train the model?


I'm not an ML guy, so can't say if this is an autoencoder.

We can combine multiple sequences in any way we want. Obviously, we can come up with some nice looking "tower of lstms" where each level of that tower processes the corresponding F[i] sequence: sequence F1 goes to level T1 which is a bunch of LSTMs; then F2 and the output of T1 go to T2 and so on. The only thing that I think matters is (1) feed all these sequences to the model and (2) have enough weights in the model. And obviously a big GPU farm to run experiments.


Ok, but if we are using a hierarchical model like multilayer lstm, shouldn’t we expect it to learn to extract the relevant info at multiple time scales? I mean, shouldn’t the output of T1 already contain all the important info in F2? If not, what extra information do you hope to supply there via F2?


T1 indeed contains all the info needed, but T1 also has limited capacity and can't capture long patterns. T1 would need to have 100s of billions weights to capture minute long patterns. I think this idea is similar to the often used skip connections.


But the job of T1 is not to capture long term patterns, it’s to extract useful short scale features for T2 so that T2 could extract longer term patterns. T3 would hopefully extract even longer scale patterns from T2 output, and so on. That’s the point of having the lstm hierarchy, right?

Why would you try to manually duplicate this process by creating F1, F2, etc?

The idea of skip connections would be like feeding T1 output to T3, in addition to T2. Again, I’m not sure what useful info F sequences would supply in this scenario.


This sounds reasonable, but I think in practice the capacity of T1 won't be enough to capture long patterns and the F2 sequence is supposed to help T2 to restore the lost info about the longer pattern. The idea is to make T1 really good at capturing small patterns, like speech in pop music, while T2 would be responsible for background music with longer patterns.

Don't we already do this with text translation? Why not to let one model read a printed text pixel by pixel and the other model produce a translation, also pixel by pixel? Instead we choose to split printed text into small chunks (that we call words), give every chunk a "word vector" (those word2vec models) and produce text also one word at a time.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: