Hacker News new | past | comments | ask | show | jobs | submit login

> We also introduce TTRL (Test-Time Reinforcement Learning), where we perform reinforcement learning on variants of test problems at inference time. TTRL enables Qwen2.5 7B Deepseek-R1 Distilled to achieve a state-of-the-art score of 90% on the MIT Integration Bee qualifying examination, surpassing OpenAI o1's performance.

That's incredible!




Is there code for this paper or something that does something similar to this?


i mean they have a verifier, so can't they even get to 90% just by random generation by the net and testing against verifier until it's numerically correct? I think the end solve rate is less important and the generality of approach is maybe more important


No, they specifically test for this (the "RL" case). They most particularly can not do this with random generation, which is very interesting.


but i mean it depends on how many attempts you let it generate. the right comparison is to use the test time rl compute to just do generation and compare success rates. (if you gen for long enough you eventually will hit the answer by chance)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: