The way you taught chain-of-thought before was with supervised fine tuning (SFT)...

vimgrinder · 2025-01-26T08:18:07 1737879487

That's nice explanation. Is there any insights so far in the field about why chain of thought improves the capability of a model? Does it like provide model with more working memory or something in the context itself?

cgearhart · 2025-01-26T18:55:44 1737917744

I don’t think there’s consensus. Some papers have shown that just giving the model more tokens improves the results, ie chain of thought allows more computation to happen and that is enough to improve results. Others have argued that the smaller steps themselves are easier to solve and thus it’s easier to reach the right answer.

I think CoT is important because it’s _free_. You adjust the prompt (model input) at the end of training. Magically, this seems to work. That makes it hard to beat even if you did have a clear understanding of the mechanism at a more fundamental level.

jobhadel · 2025-01-26T08:31:29 1737880289

Chain of thought breaks down a problem into smaller chunks which is easier to solve for a model than trying to find solution directly for larger problem

vimgrinder · 2025-01-26T12:10:39 1737893439

Do you think this feature i.e. 'finding smaller chunks easier to solve' comes out from the dataset these are trained on or is it more related to architecture components?

johnthewise · 2025-01-26T12:58:30 1737896310

I feel it’s not related to data or the architecture but the process of reasoning in general. For these models, every token predicted condition or drives the output in certain direction. Semantic meaning of these tokens have a magnitude in solution space. Lets say ‘answer is 5’ is very large step and ‘and’ token is very small. If you are looking for a very specific answer, these smaller nudges of each token generation provide corrections to direction. Imagine trying to click on narrow button with high sensitivity mouse settings, obviously you need to do many smaller moves whereas with a big button maybe you can one shot it. The harder or specific a task is where a solution space is very narrow that it cant be possibly one shotted, you need to learn to take smaller steps and possibly revert if you feel overall direction is bad. This is what RL is teaching the model here, response length increases(model learns to take smaller steps, reverts etc) along with performance. You reward the model if solution is correct, model discovers being cautious and evaluating many steps is the better approach. Personally I feel this is how we reason, or reasoning is in general taking smaller steps and being able to evaluate if you are in a wrong position so you cna backtrack. Einstein didn’t one shot relativity after all and had to backtrack from who knows how many things.

littlestymaar · 2025-01-26T08:34:34 1737880474

> Then they give it a relatively small number of problems to solve (8k). They use a simple reinforcement learning loop where the only goal is to get the problem right. They don’t care how the model got the right answer, just that it’s correct.

I guess it only works if you select problems that are within reach of the model in the first place (but not too easy), so that there can actually be a positive feedback loop, right?

johnthewise · 2025-01-26T13:12:42 1737897162

Yes, that’s kind of a given. The model has to have all the knowledge components to solve a task, so a capable base model is needed and only thing thats being learned here is how to stitch base knowledge to plan an attack. No amount of RL with a dumb base model would have worked for example.

noduerme · 2025-01-26T08:31:44 1737880304

whum... doesn't not caring how it got the answer right create the same exact problem as fine tuning?