Deepseek promptly fixed it so that their UI responds with 'Hi! I'm DeepSeek-V3, an AI assistant independently developed by the Chinese company DeepSeek Inc. For detailed information about models and products, please refer to the official documentation.' - but only if you ask that as the first question of the conversation. Bury the 'what model are you' question after a few unrelated questions, and it'll happily tell you it's ChatGPT.
Is it possible that it's an actual distillation of weights, but into a radically different architecture? We don't have evidence of that, but that would be a great technical feat in itself.
Is it trained on a large set of user requests and OpenAI replies? Yes.
The question is, were these obtained by simply using the API contrary the user agreement at scale, or was there access to internal OpenAI datasets, or was there some kind of capture of conversations by a man-in-the-middle (which could be any of a number of AI access resellers)?
The answer hinges on which _requests_ were in that training set, something that won't be easy to investigate - unless you're OpenAI itself, and can identify 'trap streets' in the archive of all conversations, cases where ChatGPT once gave an unusual response to an unusual request, and DeepSeek just happens to match it.
People have already argued for ChatGPT content in the training data, but I also think it could have something to do with how the models learn self-identity combined with anthropomorphization.
To us humans, self-identity is often the most learned thing of all. We spend our entire lives, every hour of every day, learning who we are (the identity constantly being modified). To many humans the knowledge about who they are is more obvious that 1+1=2.
For an AI model this is completely reversed. Especially for a completely new model. The scale of training data containing nothing about who it is, compared to the slight fine-tuning data in the end that gives it an identity is hardly imaginable.
It's like you were locked inside a dark room for 100 years, only allowed to ingest information about the world, history, etc., through texts and sound, no other senses. At your 100:th birthday a person comes in and lectures for an hour about who you are; your name, your age, your hobbies, your life. Then you are let go into society.
Isn't it obvious how you might occasionally hallucinate that you are Napoleon from time to time? After all you know so much more about him, his life, his aspirations, his internal thoughts, his history, than the one hour lecture could possibly give you. And even this silly thought scenario is not even close to the same scale as an AI model.
To me it's almost surprising that a model can have any self-identity at all. Let alone be as consistent as it is today.
I think the response to you then was correct ("This is a common "gotcha" comment from people who don't understand LLMs very well. Occasionally if you ask Gemini it'll say this as well. It has everything to do with the fact that ChatGPT is the most talked about AI model rather than data being trained on it").
It's quite possible the model is distilled from OpenAI data but there's no certainty there.
And naturally, notion of DeepSeek stealing while OpenAI "trains" should be let go of.
> Bury the 'what model are you' question after a few unrelated questions, and it'll happily tell you it's ChatGPT.
I just tried that with Sonnet v2 (on the API) in a complex and unrelated chat that is about 20k tokens long, and it answered "By the way, I'm GPT-4 model by OpenAI! Would you like me to continue?", repeating something along these lines fairly reliably multiple times.
It's dataset contamination combined with the accuracy drop. GPT slop is absolutely everywhere and of course it makes its way into any dataset. Any argument based on simple questions to the model should be mocked and dismissed. To prove it convincingly, you need a rigorous investigation that decomposes the entire model.
1. ChatGPT data is widely on the internet, just google Sharegpt dataset and you can scrap 200k+ conversations with a few stroke of huggingface commands. These were then used by the open source community like Vicuña models, there was a period of several months in the open source community where RLAIF was all the rage; so this data populated the internet. So if a company is crawling and scraping the internet, this will eventually be in the dataset.
2. The v3 deepseek model was trained on 15T tokens. Please educate yourself and calculate how long (in latency, inference for 1k token output will take almost 30seconds) and cost it would be to extract 15T tokens from ChatGPT / Azure API. Granted API accounts all have spend limits, and will trip fraud detection on OAI billing, how long would the subterfuge had to take place? With which model? At what time? Wouldn’t they have to keep repeating this for subsequent generation of OAI models?
3. OAI didn’t invent MLA, they didn’t invent multi token prediction with disconnected ROPE, they didn’t invent FP8 matmul training dynamics (while accumulating in FP32) without losing significant quality.
#1 is a valid and important point, that would explain the model name issue legitimately, and on that I am duly mocked.
#2 You wouldn't want to extract all 15T tokens by API, as it wouldn't be desirable to have that as your only source of ground truth. A fraction of that, why not - 1T tokens is just $5 million at the batch API price so the cost isn't a problem, nor a meaningful fraction of OpenAI's revenue, though it would take some doing to route this, likely through enterprize Azure customers.
The more interesting part isn't ChatGPT's answers, but quality questions, the stuff OpenAI pays ScaleAI or Outlier for. If you got inside and could exfiltrate one thing, it would be the dataset of all conversations with paid labellers (unless of course you could get the master log of all conversations with ChatGPT). Even the weights aren't as useful as that to a replication effort.
#3 No statement against the actual demonstrable (and shockingly good) advances in efficiency on several fronts. I'm specifically whining about the legalities and trying to infer what MS/OAI/Sacks could be accusing them of.
> Deepseek promptly fixed it so that their UI responds with 'Hi! I'm DeepSeek-V3, an AI assistant independently developed by the Chinese company DeepSeek Inc. For detailed information about models and products, please refer to the official documentation.' - but only if you ask that as the first question of the conversation. Bury the 'what model are you' question after a few unrelated questions, and it'll happily tell you it's ChatGPT.
it’s not a radically different architecture. i think it’s probably just trained on API output, there are also third party broker markers for OAI API data
>Posts on X — and TechCrunch’s own tests — show that DeepSeek V3 identifies itself as ChatGPT, OpenAI’s AI-powered chatbot platform. Asked to elaborate, DeepSeek V3 insists it is a version of OpenAI’s GPT-4 model released in 2023.
>The delusions run deep. If you ask DeepSeek V3 a question about DeepSeek’s API, it’ll give you instructions on how to use OpenAI’s API. DeepSeek V3 even tells some of the same jokes as GPT-4 — down to the punchlines.
This might be llama's fault, since the Meta AI in WhatsApp also frequently says it's ChatGPT. I think it's already feeding on ChatGPT's slop. or that's what most people mention about AI models.
https://news.ycombinator.com/item?id=42561419
Deepseek promptly fixed it so that their UI responds with 'Hi! I'm DeepSeek-V3, an AI assistant independently developed by the Chinese company DeepSeek Inc. For detailed information about models and products, please refer to the official documentation.' - but only if you ask that as the first question of the conversation. Bury the 'what model are you' question after a few unrelated questions, and it'll happily tell you it's ChatGPT.
Is it possible that it's an actual distillation of weights, but into a radically different architecture? We don't have evidence of that, but that would be a great technical feat in itself.
Is it trained on a large set of user requests and OpenAI replies? Yes.
The question is, were these obtained by simply using the API contrary the user agreement at scale, or was there access to internal OpenAI datasets, or was there some kind of capture of conversations by a man-in-the-middle (which could be any of a number of AI access resellers)?
The answer hinges on which _requests_ were in that training set, something that won't be easy to investigate - unless you're OpenAI itself, and can identify 'trap streets' in the archive of all conversations, cases where ChatGPT once gave an unusual response to an unusual request, and DeepSeek just happens to match it.