Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The real thing that surprises me (as a layman trying to get up to speed on this stuff) is that there's no "trick" to it. It really just does seem to be a textbook application of RL to LLMs.

Going from a base LLM to human instruction-tuned (SFT) ones is definitely an ingenious leap where it's not obvious that you'd get anything meaningful. But when we quickly saw afterwards that prompting for chain of thought improved performance, why wasn't this the immediate next step that everyone took. It seems like even after the release of o1 the trick wasn't apparent to everyone, and if it wasn't for DeepSeek people still might not have realized it.



> why wasn't this the immediate next step that everyone took.

It was actually tested by various labs. Just probably not at this scale. The first model that featured RL prominently was DeepSeek-math-7b-RL, published last year in april. It was at the time the best model for math, and remained so until the qwen2.5-math series, that probably had way more data put into them.

There's a thing about RL that makes it tricky - the models tend to behave very stubbornly. That is, if they see something that resembles their training method (i.e. math problems), they'll solve the problem, and they'll be good at it. But if you want something close to that but not quite solving it (i.e. analyse this math problem and write hints, or here are 5 problems extract the common methods used for solving, etc.) you'll see that they perform very poorly, often times just going straight into "to solve this problem we...".

This is even mentioned in the R1 paper. Poor adherence to prompts, especially ssytem prompts. So that is still challenging.


I think the issue with RL is that, in order for a model to perform well in a task, you have to make it stubborn. In the same way a student that thinks outside the scope of the task might not perform well in a graded exam, but that does not mean he/she is a bad reasoner. With RL and all training procedure you are creating a very focused and very fit to the task thinker, which might not be useful in all applications (consider an open problem, it might need an out of the box kind of thought).


Chain of thought prompting ("think step by step") only encourages the model to break the problem into steps, which allows it to incrementally build upon each step (since the output is fed back in as part of the input).

Reasoning requires more than chain of thought, since it's often not apparent what the next step should be - you (human, or model) may go down one path of reasoning only to realize it's going nowhere, and have to back up and try something else instead. This ability to "back up" - to realize that an earlier reasoning "step" was wrong and needs to be rethought is what was mostly missing from models that (unlike o1, etc) hadn't been trained for reasoning.

The reason non-reasoning models can't reason appears to be because this type of chain-of-consciousness thought (thinking out loud, mistakes and all) when trying to figure out a problem is hugely underrepresented in a normal training set. Most writing you find on the internet, or other sources, is the end result of reasoning - someone figured something out and wrote about it - not the actual reasoning process (mistakes and all) that got them there.

It's still not clear what OpenAI had to do, if anything, to help bootstrap o1 (special hand-created training data?), but basically by using RL to encourage certain types of reasoning pattern, they were able to get the model to back-up and self-correct when needed. DeepSeek-R may well have used o1 reasoning outputs as a bootstrap, but have been able to replicate RL training to encourage self-correcting reasoning in the same way.

One interesting aspect of DeepSeek-R is that they have shown that once you have a reasoning model, you can run it and use it to generate a bunch of reasoning outputs that can then be used as normal training data to fine-tune a non-reasoning model, even a very small one. This proves that, at least to some degree, the reason non-reasoning models couldn't reason is just because they had not been trained on sufficient self-correcting reasoning examples.


> since it's often not apparent what the next step should be

Backtracking assumes depth-first search, which isn't strictly needed as you could explore all possible options in parallel in a breadth-first manner, but incrementally until one branch returns a satisfactory answer.

> This proves that, at least to some degree, the reason non-reasoning models couldn't reason is just because they had not been trained on sufficient self-correcting reasoning examples.

For sure this is a big reason, and probably also part of the reason they hallucinate rather than say they don't know or aren't sure.


> Backtracking assumes depth-first search, which isn't strictly needed as you could explore all possible options in parallel in a breadth-first manner

You could in theory, but it'd be massively/prohibitively more expensive - exploring a whole tree just to end up using a single branch. It'd be like trying to have a computer play chess by evaluating EVERY branching position out to some fixed depth, rather than using MCTS to prune away unpromising lines.

Not only that, but reasoning in general (of which LLM-based reasoning is only a limited case) isn't just about search - it can also require exploration and acquisition of new knowledge if you run into an impasse that your past experience can't handle. If AI systems hope to achieve human-level AGI, they will need to change to support continuous learning and exploration as a part of reasoning, which naturally means more of a depth-first approach (continue until you hit an impasse) with backtracking.

You can say that hallucination is due to gaps in the training set, but of course humans don't have that problem because we know what we know, and have episodic memories of when/where/how we learned things. LLMs have none of that.


I've wondered this too, I really hope someone with more knowledge can comment. My impression is that people worked on this kind of thing for years before they started seeing a 'signal' i.e. that they actually got RL working to improve performance. But why is that happening now? What were the tricks that made it work?


If you check failure section of their paper, they also tried other methods like MCTS and PRM which is what other labs have been obsessing about but couldn't move on from (that includes bigshots). Only team which I am aware which tried verifiable rewards is tulu but they didn't scaled it up and just left it there.

This sort of thing imo is similar to what openAI did with transformer architecture i.e. google invented it but couldn't scale it in the right direction and deepmind got busy with atari games. They had all the pieces still openai could do it. It seems to be it comes down to research leadership in what methods to choose to invest in. But yeah, the budgets big labs have, they can easily try 10 different techniques and brute force it all but seems like they are too opinionated in methods and less urgent on outcomes.

[paper] https://arxiv.org/pdf/2501.12948 [tulu] https://x.com/hamishivi/status/1881394117810500004


I found the following thread more insightful than my original comment (wish I could edit that one). A research explains why RL didn't work before this: https://x.com/its_dibya/status/1883595705736163727


Related: https://twitter.com/voooooogel/status/1884089601901683088#m

Also https://epoch.ai/gradient-updates/how-has-deepseek-improved-... has a summary of all the architectural improvements DeepSeek made to increase performance.


That's interesting. I suppose it could even be possible to test his theories. Just applied the exact same training methodology to smaller models or slightly easier problems and study what happens.



DeepSeek only recently invented GRPO, it's possible that was the final missing piece needed to make it viable.


The group in this article used straight and simple PPO, so I guess GRPO isn't required.

My hypothesis is that everyone was just so stunned by oai's result so most just decided to blindly chase it and do what oai did (i.e. scaling up). And it's only after o1 people started seriously trying other ideas.


I don't have any intuition here and am in no way qualified, but my read of the paper was that GRPO was mainly an optimization to reduce cost & GPUs when training (by skipping the need to keep another copy of the LLM in memory as the value network), but otherwise any RL algorithm should have worked? I mean it seems R1 uses outcome rewards only and GRPO doesn't do anything special to alleviate reward sparsity, so it feels like it shouldn't affect viability too much.

Also on the note of RL optimizers, if anyone here is familiar with this space can they comment on how the recently introduced PRIME [1] compares to PPO directly? Their description is confusing since the "implicit PRM" they introduce which is trained alongside the policy network seems no different from the value network of PPO.

[1] https://github.com/PRIME-RL/PRIME


the tulu team saw it. but, yes nobody like scaled it to the extent deepseek did. I am surprised that the faang labs which have the best of the best didn't see this.


> I am surprised that the faang labs which have the best of the best didn't see this.

After so many layoff rounds, they might have got stuck with the best at avoiding it.


How do we know that they didn't see it? Their work is much more secret now. Isn't it possible that o1 and o3 rely on something similar maybe with some additions. Same for the gemini thinking models.

My point it that OpenAI and google might have been working with very similar approaches for months.


agreed. that's err on my part to mention it like that. more evidence suggest that they were working on similar stuff but now the cat is out of the bag and open source got a win.


I wonder if OpenAI did the same thing, or they instead took the approach of manually building an expensive, human-designed supervised learning dataset for reasoning. If the latter, they must be really kicking themselves now.


I'd bet $5 that o1 was also built with either RL or search, or a combination of the two. That was what I initially thought when they announced o1-preview, after I saw the sample reasoning traces.

But alas I am just an ML enthusiast, not a member of some lab with access to GPUs.



I think a lot of it had to do with DeepSeek need to use as fewer resources as possible why did it do this how can it do it in fewer steps using fewer resources. Where as most of the FAANG were looking at throwing more data and processing power at it.


This was my takeaway as well, the paper was so simple I was shocked by it. We’ve been doing RL on LLMs for awhile now and it’s more surprising this didn’t happen sooner




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: