It's probably only a matter of time before we have a GauGAN like interface for s...

lucidrains · on April 25, 2019

Given how easy it is to train a Transformer on any sequence data, and given how plentiful open source code is, I'd say "CodeNet" is probably less than a year away. OpenAI will probably do it first given they already have the setup.

yeldarb · on April 25, 2019

I'm working on this.

I've been training on Stack Overflow and the model has already learned the syntaxes and common coding conventions of a bunch of different languages all on its own. Excited to see what else it's able to do as I keep experimenting.

Some sample outputs (you'll probably want to browse to some of the "Random" questions because by default it's showing "answers" right now and I haven't trained that model as long as some of the older question-generation ones): https://stackroboflow.com

lucidrains · on April 25, 2019

I've tried it as well and got good syntactic results. For more sensical programs, I think we will need more layers & attn heads. Perhaps someone will fork gpt-2 and add the sparse transformer to it.

mgalgs · on April 26, 2019

These are actually a lot of fun to read. Kudos!

pdxww · on April 26, 2019

That CodeNet would be the SkyNet, essentially. What's shown here looks impressive, but it's the same good old text generator that can produce something that looks very similar to the dataset used to train it. It can't go beyond the dataset and generate something new. From the mathematical point of view, that generator interpolates samples from the dataset and generates a new sample.

To give an idea how big is the gap between MuseNet and CodeNet, we can consider a simple problem of reversing a sequence: [1,2,3,4,5] should become [5,4,3,2,1] and so on. How many samples do you need to look at to understand how to reverse an arbitrary sequence of numbers? Do you need to retrain your brain to reverse a sequence of pictures? No, because instead of memorizing the given samples, you looked at a few and built a mental model of "reversing a sequence of things". Now, the state of the art ML models can reverse sequences as long as they are using the same numbers as in the dataset, i.e. we can train them to reverse any sequence of 1..5 or 1..50 numbers, but once we add 6 to the input, the model instantly fails, no matter how complex and fancy it is. I don't even dare to add a letter to the input. Reason? 6 isn't in the samples it's learnt to interpolate. And CodeNet is supposed to generate a C++ program that would reverse any sequence, btw.

At the moment, ML is kinda stuck at this pictures interpolation stage. For AI, we don't need to interpolate samples, but need to build a "mental model" of what these samples are and as far as I know, we have no clue how to even approach this problem.

lucidrains · on April 26, 2019

Yeah, I know what you are saying... But let's just let somebody try this experiment (and somebody eventually will), and we can judge what can or cannot be learned by the results.

We will definitely get a great code autocompleter at the very least..

lawfulcactus · on April 25, 2019

Can you explain? I'm not an expert on ML by any stretch of the imagination, but you'd think with the sort of stringent logical coherence required to construct useful programs, it'd be a pretty subpar use case. Or do you mean smaller-scope tools to aid programming, like linters and autocompleters?

whatshisface · on April 26, 2019

I wonder if you could find a representation for computer programs that eliminated all of the degrees of freedom that were syntax errors, leaving only valid programs. In a sense that's what an AST is but you can still have invalid ASTs. I bet it would be a lot easier to generate interesting programs in a representation like that.

namibj · on April 26, 2019

There is cartesian genetic programming and some lisp-like models to encode a program as a tree where all combination are valid. Combined with recent work on convolutional graph DNNs, this might be a good approach.

sanxiyn · on April 25, 2019

There already is TabNine. TabNine is a machine learning autocompleter that works for all programming languages.

https://tabnine.com/