Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I quite like a scenery where llm output can't be copyrighted, so that it is possible to eventually train a llm with data from the previous one(s)


OpenAI argues it’s a violation of their terms of service. So there are legal issues if it can be proven.


Legal issues for who?

Company A pays OpenAI for their API. They use the API to generate or augment a lot of data. They own the data. They post the data on the open Internet.

Company B has the habit of scraping various pages on the Internet to train its large language models, which includes the data posted by Company A. [1]

OpenAI is undoubtedly breaking many terms of service and licenses when it uses most of the open Internet to train its models. Not to mention potential copyright violations (which do not apply to AI outputs).

[1]: This is not hypothetical BTW. In the early days of LLMs, lots of large labs accidentally and not so accidentally trained on the now famous ShareGPT dataset (outputs from ChatGPT shared on the ShareGPT website).


For both.


Posting OpenAI generated data on the internet is not breaking the ToS. This is how most OpenAI based businesses operate, after all [1] (e.g. various businesses to generate articles with AI, various chat businesses that let you share your chats, etc.)

OpenAI is one of the companies like Company B that is using data from the open Internet.

[1] Ownership of content. As between you and OpenAI, and to the extent permitted by applicable law, you (a) retain your ownership rights in Input and (b) own the Output. We hereby assign to you all our right, title, and interest, if any, in and to Output.


But OpenAI's model isn't open source, how would they distill knowledge without direct access to the model?


You don’t need direct access for LLM distillation, just regular API access.


ok I looked it up and have a better understanding now.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: