You should still be mocked. 1. ChatGPT data is widely on the internet, just goog...

waldrews · on Jan 29, 2025

#1 is a valid and important point, that would explain the model name issue legitimately, and on that I am duly mocked.

#2 You wouldn't want to extract all 15T tokens by API, as it wouldn't be desirable to have that as your only source of ground truth. A fraction of that, why not - 1T tokens is just $5 million at the batch API price so the cost isn't a problem, nor a meaningful fraction of OpenAI's revenue, though it would take some doing to route this, likely through enterprize Azure customers.

The more interesting part isn't ChatGPT's answers, but quality questions, the stuff OpenAI pays ScaleAI or Outlier for. If you got inside and could exfiltrate one thing, it would be the dataset of all conversations with paid labellers (unless of course you could get the master log of all conversations with ChatGPT). Even the weights aren't as useful as that to a replication effort.

#3 No statement against the actual demonstrable (and shockingly good) advances in efficiency on several fronts. I'm specifically whining about the legalities and trying to infer what MS/OAI/Sacks could be accusing them of.