The way you taught chain-of-thought before was with supervised fine tuning (SFT). During training, you have to rate every sentence of reasoning the model writes, many times, to nudge it to reason correctly.
But this approach to teach chain-of-thought doesn’t do that. In this post, they take a small model (7B) that already knows math. Then they give it a relatively small number of problems to solve (8k). They use a simple reinforcement learning loop where the only goal is to get the problem right. They don’t care how the model got the right answer, just that it’s correct.
This is part of the “recipe” that DeepSeek used to create R1.
After many iterations, just like DeepSeek, they found that the model has an “aha” moment. It starts emitting chains-of-thought where it wasn’t before. And then it starts getting the math answers right.
This is the gist of it. I don’t fully understand the recipe involved.
Can you teach small models to “think” just using RL? How small can they be? What tasks does this work for? Is just RL best for this? RL+SFT? Everyone’s trying to figure it out.
That's nice explanation. Is there any insights so far in the field about why chain of thought improves the capability of a model? Does it like provide model with more working memory or something in the context itself?
I don’t think there’s consensus. Some papers have shown that just giving the model more tokens improves the results, ie chain of thought allows more computation to happen and that is enough to improve results. Others have argued that the smaller steps themselves are easier to solve and thus it’s easier to reach the right answer.
I think CoT is important because it’s _free_. You adjust the prompt (model input) at the end of training. Magically, this seems to work. That makes it hard to beat even if you did have a clear understanding of the mechanism at a more fundamental level.
Chain of thought breaks down a problem into smaller chunks which is easier to solve for a model than trying to find solution directly for larger problem
Do you think this feature i.e. 'finding smaller chunks easier to solve' comes out from the dataset these are trained on or is it more related to architecture components?
I feel it’s not related to data or the architecture but the process of reasoning in general. For these models, every token predicted condition or drives the output in certain direction. Semantic meaning of these tokens have a magnitude in solution space. Lets say ‘answer is 5’ is very large step and ‘and’ token is very small. If you are looking for a very specific answer, these smaller nudges of each token generation provide corrections to direction. Imagine trying to click on narrow button with high sensitivity mouse settings, obviously you need to do many smaller moves whereas with a big button maybe you can one shot it. The harder or specific a task is where a solution space is very narrow that it cant be possibly one shotted, you need to learn to take smaller steps and possibly revert if you feel overall direction is bad. This is what RL is teaching the model here, response length increases(model learns to take smaller steps, reverts etc) along with performance. You reward the model if solution is correct, model discovers being cautious and evaluating many steps is the better approach. Personally I feel this is how we reason, or reasoning is in general taking smaller steps and being able to evaluate if you are in a wrong position so you cna backtrack. Einstein didn’t one shot relativity after all and had to backtrack from who knows how many things.
> Then they give it a relatively small number of problems to solve (8k). They use a simple reinforcement learning loop where the only goal is to get the problem right. They don’t care how the model got the right answer, just that it’s correct.
I guess it only works if you select problems that are within reach of the model in the first place (but not too easy), so that there can actually be a positive feedback loop, right?
Yes, that’s kind of a given. The model has to have all the knowledge components to solve a task, so a capable base model is needed and only thing thats being learned here is how to stitch base knowledge to plan an attack.
No amount of RL with a dumb base model would have worked for example.
But this approach to teach chain-of-thought doesn’t do that. In this post, they take a small model (7B) that already knows math. Then they give it a relatively small number of problems to solve (8k). They use a simple reinforcement learning loop where the only goal is to get the problem right. They don’t care how the model got the right answer, just that it’s correct.
This is part of the “recipe” that DeepSeek used to create R1.
After many iterations, just like DeepSeek, they found that the model has an “aha” moment. It starts emitting chains-of-thought where it wasn’t before. And then it starts getting the math answers right.
This is the gist of it. I don’t fully understand the recipe involved.
Can you teach small models to “think” just using RL? How small can they be? What tasks does this work for? Is just RL best for this? RL+SFT? Everyone’s trying to figure it out.