1. ChatGPT data is widely on the internet, just google Sharegpt dataset and you can scrap 200k+ conversations with a few stroke of huggingface commands. These were then used by the open source community like Vicuña models, there was a period of several months in the open source community where RLAIF was all the rage; so this data populated the internet. So if a company is crawling and scraping the internet, this will eventually be in the dataset.
2. The v3 deepseek model was trained on 15T tokens. Please educate yourself and calculate how long (in latency, inference for 1k token output will take almost 30seconds) and cost it would be to extract 15T tokens from ChatGPT / Azure API. Granted API accounts all have spend limits, and will trip fraud detection on OAI billing, how long would the subterfuge had to take place? With which model? At what time? Wouldn’t they have to keep repeating this for subsequent generation of OAI models?
3. OAI didn’t invent MLA, they didn’t invent multi token prediction with disconnected ROPE, they didn’t invent FP8 matmul training dynamics (while accumulating in FP32) without losing significant quality.
#1 is a valid and important point, that would explain the model name issue legitimately, and on that I am duly mocked.
#2 You wouldn't want to extract all 15T tokens by API, as it wouldn't be desirable to have that as your only source of ground truth. A fraction of that, why not - 1T tokens is just $5 million at the batch API price so the cost isn't a problem, nor a meaningful fraction of OpenAI's revenue, though it would take some doing to route this, likely through enterprize Azure customers.
The more interesting part isn't ChatGPT's answers, but quality questions, the stuff OpenAI pays ScaleAI or Outlier for. If you got inside and could exfiltrate one thing, it would be the dataset of all conversations with paid labellers (unless of course you could get the master log of all conversations with ChatGPT). Even the weights aren't as useful as that to a replication effort.
#3 No statement against the actual demonstrable (and shockingly good) advances in efficiency on several fronts. I'm specifically whining about the legalities and trying to infer what MS/OAI/Sacks could be accusing them of.
1. ChatGPT data is widely on the internet, just google Sharegpt dataset and you can scrap 200k+ conversations with a few stroke of huggingface commands. These were then used by the open source community like Vicuña models, there was a period of several months in the open source community where RLAIF was all the rage; so this data populated the internet. So if a company is crawling and scraping the internet, this will eventually be in the dataset.
2. The v3 deepseek model was trained on 15T tokens. Please educate yourself and calculate how long (in latency, inference for 1k token output will take almost 30seconds) and cost it would be to extract 15T tokens from ChatGPT / Azure API. Granted API accounts all have spend limits, and will trip fraud detection on OAI billing, how long would the subterfuge had to take place? With which model? At what time? Wouldn’t they have to keep repeating this for subsequent generation of OAI models?
3. OAI didn’t invent MLA, they didn’t invent multi token prediction with disconnected ROPE, they didn’t invent FP8 matmul training dynamics (while accumulating in FP32) without losing significant quality.
So go away