Sure, why not? lol https://www.reddit.com/r/DataHoarder/comments/1entowq/i_made_...

redox99 · 2025-06-16T03:20:12 1750044012

Books3 was used in Llama1. We don't know if they used it later on.

aspenmayer · 2025-06-16T03:26:22 1750044382

My comparison was illustrative and analogous in nature. The copyright cartel is making a fruit of the poisonous tree type of argument. Whatever Meta are doing with LLMs is doing the heavy lifting that parity files used to do back in the Usenet days. I wouldn’t be surprised if BitTorrent or other similar caching and distribution mechanisms incorporate AI/LLMs to recognize an owl on the wire, draw the rest just in time in transit, and just send the diffs, or something like that.

The pictures are the same. All roads lead to Rome, so they say.

aprilthird2021 · 2025-06-16T03:52:32 1750045952

All of the major AI models these days use "clean" datasets stripped of copyrighted material.

They also use data from the previous models, so I'm not sure how "clean" it really is

dragonwriter · 2025-06-16T04:03:20 1750046600

> All of the major AI models these days use "clean" datasets stripped of copyrighted material.

Which of the major commercial models discloses its dataset? Or are you just trusting some unfalsifiable self-serving PR characterization?

aprilthird2021 · 2025-06-17T07:56:28 1750146988

It's from my personal experience in the industry

aspenmayer · 2025-06-17T11:00:22 1750158022

What are your thoughts on the origin of the LLaMA leak? It's interesting that the training data was torrented, and so was the leak. Perhaps we will never know? For the OSINT folks, not a lot to go on, or maybe a lot, depending?

https://en.wikipedia.org/wiki/Llama_(language_model)#Leak

https://archived.moe/g/thread/91848262#p91850335

https://github.com/meta-llama/llama/pull/73/files

aprilthird2021 · 2025-06-18T01:35:08 1750210508

I don't really know much about that, sorry

aspenmayer · 2025-06-18T01:36:20 1750210580

I didn’t ask for info, I asked for your views. I gave you all the info anyone has publicly, so you have enough to comment.

I suspect that it was a limited hangout self-own by Meta to claim that they aren’t responsible, and then they are doing research on a leaked LLM that they developed, but then was leaked, so they can claim that the subsequent research is not tainted by the fruit of the poisonous tree legal doctrine. Or, their torrent client or other software on the same machine had 0-days and they got hacked by someone on the Books3 swarm or knowledgeable of what IPs were connecting to it.

I appreciate your posts and I am replying to you to humbly ask you to post more. :P

aprilthird2021 · 2025-06-18T17:16:16 1750266976

I'm not really sure what you are insinuating? You think Meta leaked LAMA so they could claim, legally that they are in the clear for copyright violation? Sorry, I just don't really get what you want me to opine about.

If that is what you are asking, I don't think that's what happened. It's far more likely that it was just leaked or grabbed by a hacker

aspenmayer · 2025-06-20T09:01:52 1750410112

I just thought the whole situation was interesting. You commented about the current LLM research being clean, while being based on prior LLMs which were perhaps less clean, so I thought that it was a curious coincidence how torrents kept popping up.

pclmulqdq · 2025-06-16T04:25:14 1750047914

All written text is copyrighted, with few exceptions like court transcripts. I own the copyright to this inane comment. I sincerely doubt that all copyrighted material is scrubbed.

Tepix · 2025-06-16T04:42:35 1750048955

Your brief comment is hardly copyrightable. Which makes your point moot.