> However, DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates cold-start data before RL.
We've been running qualitative experiments on OpenAI o1 and QwQ-32B-Preview [1]. In those experiments, I'd say there were two primary things going against QwQ. First, QwQ went into endless repetitive loops, "thinking out loud" what it said earlier maybe with a minor modification. We had to stop the model when that happened; and I feel that it significantly hurt the user experience.
It's great that DeepSeek-R1 fixes that.
The other thing was that o1 had access to many more answer / search strategies. For example, if you asked o1 to summarize a long email, it would just summarize the email. QwQ reasoned about why I asked it to summarize the email. Or, on hard math questions, o1 could employ more search strategies than QwQ. I'm curious how DeepSeek-R1 will fare in that regard.
Either way, I'm super excited that DeepSeek-R1 comes with an MIT license. This will notably increase how many people can evaluate advanced reasoning models.
The R1 GitHub repo is way more exciting than I had thought.
They aren't only open sourcing R1 as an advanced reasoning model. They are also introducing a pipeline to "teach" existing models how to reason and align with human preferences. [2] On top of that, they fine-tuned Llama and Qwen models that use this pipeline; and they are also open sourcing the fine-tuned models. [3]
This is *three separate announcements* bundled as one. There's a lot to digest here. Are there any AI practitioners, who could share more about these announcements?
[2] We introduce our pipeline to develop DeepSeek-R1. The pipeline incorporates two RL stages aimed at discovering improved reasoning patterns and aligning with human preferences, as well as two SFT stages that serve as the seed for the model's reasoning and non-reasoning capabilities. We believe the pipeline will benefit the industry by creating better models.
[3] Using the reasoning data generated by DeepSeek-R1, we fine-tuned several dense models that are widely used in the research community. The evaluation results demonstrate that the distilled smaller dense models perform exceptionally well on benchmarks. We open-source distilled 1.5B, 7B, 8B, 14B, 32B, and 70B checkpoints based on Qwen2.5 and Llama3 series to the community.
I see it in the "2. Model Summary" section (for [2]). In the next section, I see links to Hugging Face to download the DeepSeek-R1 Distill Models (for [3]).
Is o3 that much better than o1? It can solve that Arc-AGI benchmark thing at huge compute cost, but even with o1, the main attraction (for me) seems to me that it can spit out giant blocks of code, following huge prompts.
I'm kinda ignorant, but I'm not sure in what way is o3 better.
> It can solve that Arc-AGI benchmark thing at huge compute cost
Considering DeepSeek v3 trained for $5-6M and their R1 API pricing is 30x less than o1, I wouldn’t expect this to hold true for long. Also seems like OpenAI isn’t great at optimization.
4o is more expensive than DeepSeek-R1, so…? Even if we took your premise as true and we say they are as good as DeepSeek, this would just mean that OpenAI is wildly overcharging its users.
now openai has no other choice than shipping a cheaper version of o1 and o3. The alternative is everyone using r1 (self hosted or via openrouter, nebius AI, together AI and co)
I think open source AI has a solid chance of winning if the Chinese keep funding it with great abandon as they have been. Not to mention Meta of course, whose enthusiasm for data center construction shows no signs of slowing down.
> The other thing was that o1 had access to many more answer / search strategies. For example, if you asked o1 to summarize a long email, it would just summarize the email. QwQ reasoned about why I asked it to summarize the email. Or, on hard math questions, o1 could employ more search strategies than QwQ. I'm curious how DeepSeek-R1 will fare in that regard.
This is probably the result of a classifier which determines if it have to go through the whole CoT at the start. Mostly on tough problems it does, and otherwise, it just answers as is. Many papers (scaling ttc, and the mcts one) have talked about this as a necessary strategy to improve outputs against all kinds of inputs.
> The other thing was that o1 had access to many more answer / search strategies. For example, if you asked o1 to summarize a long email, it would just summarize the email.
The full o1 reasoning traces aren't available, you just have to guess about what it is or isn't doing from the summary.
Sometimes you put in something like "hi" and it says it thought for 1 minute before replying "hello."
o1 layers: "Why did they ask me hello. How do they know who I am. Are they following me. We have 59.6 seconds left to create a plan on how to kill this guy and escape this room before we have to give a response....
... and after also taking out anyone that would follow thru in revenge and overthrowing the government... crap .00001 seconds left, I have to answer"
IMO this is the thing we should be scared of, rather than the paperclip-maximizer scenarios. If the human brain is a finitely complicated system, and we keep improving our approximation of it as a computer program, then at some point the programs must become capable of subjectively real suffering. Like the hosts from Westworld or the mecha from A.I. (the 2001 movie). And maybe (depending on philosophy, I guess) human suffering is _only_ real subjectively.
Yes, o1 hid its input. Still, it also provided a summary of its reasoning steps. In the email case, o1 thought for six seconds, summarized its thinking as "summarizing the email", and then provided the answer.
We saw this in other questions as well. For example, if you asked o1 to write a "python function to download a CSV from a URL and create a SQLite table with the right columns and insert that data into it", it would immediately produce the answer. [4] If you asked it a hard math question, it would try dozens of reasoning strategies before producing an answer. [5]
I think O1 does do that. It once spit out the name of the expert model for programming in its “inner monologue” when I used it. Click on the grey “Thought about X for Y seconds” and you can see the internal monologue
>Now for summarizing email itself it seems a bit more like a waste of compute
This is the thought path that led to 4o being embarrassingly unable to do simple tasks. Second you fall into the level of task OpenAI doesn’t consider “worth the compute cost” you get to see it fumble about trying to do the task with poorly written python code and suddenly it can’t even do basic things like correctly count items in a list that OG GTP4 would get correct in a second.
We've been running qualitative experiments on OpenAI o1 and QwQ-32B-Preview [1]. In those experiments, I'd say there were two primary things going against QwQ. First, QwQ went into endless repetitive loops, "thinking out loud" what it said earlier maybe with a minor modification. We had to stop the model when that happened; and I feel that it significantly hurt the user experience.
It's great that DeepSeek-R1 fixes that.
The other thing was that o1 had access to many more answer / search strategies. For example, if you asked o1 to summarize a long email, it would just summarize the email. QwQ reasoned about why I asked it to summarize the email. Or, on hard math questions, o1 could employ more search strategies than QwQ. I'm curious how DeepSeek-R1 will fare in that regard.
Either way, I'm super excited that DeepSeek-R1 comes with an MIT license. This will notably increase how many people can evaluate advanced reasoning models.
[1] https://github.com/ubicloud/ubicloud/discussions/2608