Can someone summarize the upshot for people here?

teej · 2025-01-26T05:53:41 1737870821

I’ll give a “wtf does this mean” view.

We have observed that LLMs can perform better on hard tasks like math if we teach it to “think about” the problem first. The technique is called “chain-of-thought”. The language model is taught to emit a series of sentences that break a problem down before answering it. OpenAI’s o1 works this way, and performs well on benchmarks because of it.

To train a model to do this, you need to show it many examples of correct chains of thought. These are expensive to produce and it’s expensive to train models on them.

DeepSeek discovered something surprising. It turns out, you don’t need to explicitly train a model to produce a chain of thought. Instead, under the right conditions, models will learn this behavior emergently. They found a way for a language model to learn chain of thought very cheaply, and then released that model as open source.

Thought chains turn out to be extremely useful. And now that they’re cheap and easy to produce, we are learning all the different ways they can be put to use.

Some of the open questions right now are:

- Can we teach small models to learn chain-of-thought? (yes) How cheaply? On which tasks?

- Can we generate thought chains and just copy/paste them into the prompts of other models? (yes) Which domains does this work for? How well does it generalize?

That’s what this post is going after.

EGreg · 2025-01-26T08:37:28 1737880648

But how exactly does it emerge, what did they do to make that happen vs previous trainings

teej · 2025-01-26T17:48:24 1737913704

Why this behavior emerges is an active area of research. What they did is use reinforcement learning, this blog post replicates those findings. The “recipe” is detailed in the R1 paper.

pillefitz · 2025-01-26T06:05:16 1737871516

Can you explain the RL part?

teej · 2025-01-26T06:28:26 1737872906

The way you taught chain-of-thought before was with supervised fine tuning (SFT). During training, you have to rate every sentence of reasoning the model writes, many times, to nudge it to reason correctly.

But this approach to teach chain-of-thought doesn’t do that. In this post, they take a small model (7B) that already knows math. Then they give it a relatively small number of problems to solve (8k). They use a simple reinforcement learning loop where the only goal is to get the problem right. They don’t care how the model got the right answer, just that it’s correct.

This is part of the “recipe” that DeepSeek used to create R1.

After many iterations, just like DeepSeek, they found that the model has an “aha” moment. It starts emitting chains-of-thought where it wasn’t before. And then it starts getting the math answers right.

This is the gist of it. I don’t fully understand the recipe involved.

Can you teach small models to “think” just using RL? How small can they be? What tasks does this work for? Is just RL best for this? RL+SFT? Everyone’s trying to figure it out.

vimgrinder · 2025-01-26T08:18:07 1737879487

That's nice explanation. Is there any insights so far in the field about why chain of thought improves the capability of a model? Does it like provide model with more working memory or something in the context itself?

cgearhart · 2025-01-26T18:55:44 1737917744

I don’t think there’s consensus. Some papers have shown that just giving the model more tokens improves the results, ie chain of thought allows more computation to happen and that is enough to improve results. Others have argued that the smaller steps themselves are easier to solve and thus it’s easier to reach the right answer.

I think CoT is important because it’s _free_. You adjust the prompt (model input) at the end of training. Magically, this seems to work. That makes it hard to beat even if you did have a clear understanding of the mechanism at a more fundamental level.

jobhadel · 2025-01-26T08:31:29 1737880289

Chain of thought breaks down a problem into smaller chunks which is easier to solve for a model than trying to find solution directly for larger problem

vimgrinder · 2025-01-26T12:10:39 1737893439

Do you think this feature i.e. 'finding smaller chunks easier to solve' comes out from the dataset these are trained on or is it more related to architecture components?

johnthewise · 2025-01-26T12:58:30 1737896310

I feel it’s not related to data or the architecture but the process of reasoning in general. For these models, every token predicted condition or drives the output in certain direction. Semantic meaning of these tokens have a magnitude in solution space. Lets say ‘answer is 5’ is very large step and ‘and’ token is very small. If you are looking for a very specific answer, these smaller nudges of each token generation provide corrections to direction. Imagine trying to click on narrow button with high sensitivity mouse settings, obviously you need to do many smaller moves whereas with a big button maybe you can one shot it. The harder or specific a task is where a solution space is very narrow that it cant be possibly one shotted, you need to learn to take smaller steps and possibly revert if you feel overall direction is bad. This is what RL is teaching the model here, response length increases(model learns to take smaller steps, reverts etc) along with performance. You reward the model if solution is correct, model discovers being cautious and evaluating many steps is the better approach. Personally I feel this is how we reason, or reasoning is in general taking smaller steps and being able to evaluate if you are in a wrong position so you cna backtrack. Einstein didn’t one shot relativity after all and had to backtrack from who knows how many things.

littlestymaar · 2025-01-26T08:34:34 1737880474

> Then they give it a relatively small number of problems to solve (8k). They use a simple reinforcement learning loop where the only goal is to get the problem right. They don’t care how the model got the right answer, just that it’s correct.

I guess it only works if you select problems that are within reach of the model in the first place (but not too easy), so that there can actually be a positive feedback loop, right?

johnthewise · 2025-01-26T13:12:42 1737897162

Yes, that’s kind of a given. The model has to have all the knowledge components to solve a task, so a capable base model is needed and only thing thats being learned here is how to stitch base knowledge to plan an attack. No amount of RL with a dumb base model would have worked for example.

noduerme · 2025-01-26T08:31:44 1737880304

whum... doesn't not caring how it got the answer right create the same exact problem as fine tuning?

randomifcpfan · 2025-01-26T04:45:57 1737866757

The DeepSeek R1 paper explains how they trained their model in enough detail that people can replicate the process. Many people around the world are doing so, using various sizes of models and training data. Expect to see many posts like this over the next three months. The attempts that use small models will get done first. The larger models take much longer.

Small r1 style models are pretty limited, so this is interesting primarily from an “I reproduced the results” point of view, not a “here is a new model that’s useful” pov.

rahimnathwani · 2025-01-26T05:18:00 1737868680

From the Deepseek R1 paper:

  For distilled models, we apply only SFT and do not include an RL stage, even though incorporating RL could substantially boost model performance. Our primary goal here is to demonstrate the effectiveness of the distillation technique, leaving the exploration of the RL stage to the broader research community.

The impression I got from the paper, although I don't think it was explicitly stated, is that they think distillation will work better than training the smaller models using RL (as OP did).

nielsole · 2025-01-26T05:42:52 1737870172

> We demonstrate that the reasoning patterns of larger models can be distilled into smaller models, resulting in better performance compared to the reasoning patterns discovered through RL on small models

I found this statement from the paper to be at odds with what you cited, but I guess they mean SFT+RL would be better than either just SFT and RL

rahimnathwani · 2025-01-26T06:10:03 1737871803

I think they're saying that some reasoning patterns which large models can learn using only RL (i.e. without the patterns existing in the training data), can't be learned by smaller models in the same way. They have to be 'taught' through examples provided during SFT.