*The idea that they used o1's outputs for their distillation further shows that ...

david-gpu · 2025-01-29T19:56:50 1738180610

> I think the narrative of the rise of LLMs is that once the output of humans has been distilled by the model

Distillation is a term of art in AI and it is fundamentally incorrect to talk about distilling human-created data. Only an AI model can be distilled.

https://en.m.wikipedia.org/wiki/Knowledge_distillation#Metho...

joe_the_user · 2025-01-29T21:52:42 1738187562

Meh,

It seems clear that the term can be used informally to denote the boiling down of human knowledge, indeed it was used that way before AI appeared in the popular imagination.

david-gpu · 2025-01-29T22:08:55 1738188535

In the context in which you said it, it matters a lot.

>> The idea that they used o1's outputs for their distillation further shows that models like o1 are necessary.

> Hmm, I think the narrative of the rise of LLMs is that once the output of humans has been distilled by the model, the human isn't necessary.

If deepseek was produced through the distillation (term of art) of o1, then the cost of producing deepseek is strictly higher than the cost of producing o1, and can't be avoided.

Continuing this argument, if the premise is true then deepseek can't be significantly improved without first producing a very expensive hypothetical o1-next model from which to distill better knowledge.

That is the argument that is being made. Please avoid shallow dismissals.

Edit: just to be clear, I doubt that deepseek was produced via distillation (term of art) of o1, since that would require access to o1's weights. It may have used some of o1's outputs to fine tune the model, which still would mean that the cost of training deepseek is strictly higher than training o1.

joe_the_user · 2025-01-29T22:18:54 1738189134

just to be clear, I doubt that deepseek was produced via distillation

Yeah, your technical point is kind of ridiculous here that in all my uses of distillation (and in the comment I quoted), distillation is used in informal sense and there's no allegation that DeepSeek could have been in possession of OpenAI's model weights, which is what's needed for your "Distillation (term of Art)".

ada1981 · 2025-01-30T06:16:36 1738217796

I’m not sure why folks don’t speculate China is able to obtain copies of OpenAI's weights.

Seems reasonable they would be investing heavily in plaing state assets within OpenAI so they can copy the models.

joe_the_user · 2025-01-30T19:31:28 1738265488

Because it feeds conspiracy theories and because there's no evidence for it? Also, let's talk DeepSeek in particular, not "China".

Looking back on the article, it is indeed using "distillation" as a special/"term of art" but not using it correctly. IE, it's not actually speculating that DeepSeek obtained OpenAI's weights and distilled them down but rather that it used OpenAI's answers/output as a starting point (which there is a different method/"term of art").

PontifexCipher · 2025-01-29T20:18:43 1738181923

Some info that may be missing:

- v2/v3 (not r1) seem to be cloned from o1/4o output, and perform worse (this cost the oft-repeated 5ish mm USD)

- r1 is specifically a reasoning step (using RL) _on top of_ v2/v3 and performs similarly to o1 (the cost of this is _not reported anywhere_)

- In the o1 blog post, they specifically say they use RL to add reasoning to LLMs: https://openai.com/index/learning-to-reason-with-llms/

sudosysgen · 2025-01-29T20:29:18 1738182558

The R1-Zero paper shows how many training steps the RL took, and it's not many. The cost of the RL is likely a small fraction of the cost of the foundational model.