Is it accurate to compare 8k example RL with 8k example SFT? RL with the same am...

johnthewise · 2025-01-26T13:34:47 1737898487

Emergent properties are nice. They show CoT now, but who knows if there is a better planning strategy? Second thing is it kind of implies every base model can be increased in capability just with some RL tuning, cheaply. So in theory you can plug in every observable and quantifiable outcome beyond math and coding(stock returns, scientific experiment results?) and let it learn how to plan it to solve it better. Train on Observed effects of various drugs on people, it then creates a customized treatment plan for you? Sft version would be limited by doctors opinion on why certain drugs affected the outcome, whereas RL version can discover unknown relationship.