I quite like a scenery where llm output can't be copyrighted, so that it is poss...

layer8 · 2025-01-29T13:23:23 1738157003

OpenAI argues it’s a violation of their terms of service. So there are legal issues if it can be proven.

Palmik · 2025-01-29T14:45:07 1738161907

Legal issues for who?

Company A pays OpenAI for their API. They use the API to generate or augment a lot of data. They own the data. They post the data on the open Internet.

Company B has the habit of scraping various pages on the Internet to train its large language models, which includes the data posted by Company A. [1]

OpenAI is undoubtedly breaking many terms of service and licenses when it uses most of the open Internet to train its models. Not to mention potential copyright violations (which do not apply to AI outputs).

[1]: This is not hypothetical BTW. In the early days of LLMs, lots of large labs accidentally and not so accidentally trained on the now famous ShareGPT dataset (outputs from ChatGPT shared on the ShareGPT website).

layer8 · 2025-01-29T15:13:50 1738163630

For both.

Palmik · 2025-01-29T15:31:01 1738164661

Posting OpenAI generated data on the internet is not breaking the ToS. This is how most OpenAI based businesses operate, after all [1] (e.g. various businesses to generate articles with AI, various chat businesses that let you share your chats, etc.)

OpenAI is one of the companies like Company B that is using data from the open Internet.

[1] Ownership of content. As between you and OpenAI, and to the extent permitted by applicable law, you (a) retain your ownership rights in Input and (b) own the Output. We hereby assign to you all our right, title, and interest, if any, in and to Output.

mannewalis · 2025-01-29T13:30:16 1738157416

But OpenAI's model isn't open source, how would they distill knowledge without direct access to the model?

layer8 · 2025-01-29T13:36:56 1738157816

You don’t need direct access for LLM distillation, just regular API access.

mannewalis · 2025-01-29T13:41:27 1738158087

ok I looked it up and have a better understanding now.