Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

With all my recent ML PhD knowledge I cannot even explain how the model is able to do this. Surely the data distribution doesn’t have all possible responses to all possible linux commands? a full python repl? I’m stumped!


I'm just a layman but I don't think anyone really expected or knows _why_ just stacking a bunch of attention layers works so well. It's not immediately obvious that doing well on predicting a masked token is going to somehow "generalize" to being able to provide coherent answers to prompts. You can sort of squint and try to handwave it, but if it were obvious that this would work you'd think people would have experimented with it before 2018.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: