Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

OpenAI has been hiding their datasets, and certainly haven't credited me for the data they stole from my website and github repositories. If OpenAI doesn't think they should give attribution to the data they used, it seems weird to require that of others.

Edit: Responding to your edit, Deepseek only claimed that the final training run was $5m, not that the whole process caught that (they even call this out). I think it's important to acknowledge that, even if they did get some training data from OpenAI, this is a remarkable achievement.




It is a remarkable achievement. But if “some training data from OpenAI” turns out to essentially be a wholesale distillation of their entire model (along with Llama etc) I do think that somewhat dampens the spirit of it.

We don’t know that of course. OpenAI claim to have some evidence and I guess we’ll just have to wait and see how this plays out.

There’s also a substantial difference between training of the entire internet and one that very specifically targets your competitor's products (or any specific work directly).


Only weird if you think what OpenAI did should be the norm.


Right. I think many here are enjoying the Schadenfreude against OpenAI, but that hardly makes it right. It just makes it a race to the bottom.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: