Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Is it accurate to compare 8k example RL with 8k example SFT? RL with the same amount of examples would take massively more compute than the SFT version (though depending on how many rollouts they do per example).

RL is more data-efficient but that may not be relevant now that we can just use Deepseek-R1's responses as the training data.



Emergent properties are nice. They show CoT now, but who knows if there is a better planning strategy? Second thing is it kind of implies every base model can be increased in capability just with some RL tuning, cheaply. So in theory you can plug in every observable and quantifiable outcome beyond math and coding(stock returns, scientific experiment results?) and let it learn how to plan it to solve it better. Train on Observed effects of various drugs on people, it then creates a customized treatment plan for you? Sft version would be limited by doctors opinion on why certain drugs affected the outcome, whereas RL version can discover unknown relationship.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: