Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

As I've said several times, the corpus is key: LLMs thus far "read" most anything, but should instead have well-curated corpora. "Garbage In, Garbage Out!(GIGO)" is the saying.

While the Harry Potter series may be fun reading, it doesn't provide information about anything that isn't better covered elsewhere. Leave Harry Potter for a different "Harry Potter LLM".

Train scientific LLMs to the level of a good early 20th century English major and then use science texts and research papers for the remainder.



> While the Harry Potter series may be fun reading, it doesn't provide information about anything that isn't better covered elsewhere

It has copyright implications - if Claude can recollect 42% of a copyrighted product without attribution or royalties, how did Anthropic train it?

> Train scientific LLMs to the level of a good early 20th century English major and then use science texts and research papers for the remainder

Plenty of in-stealth companies approaching LLMs via this approach ;)

For those of us who studied the natural sciences and CS in the 2000s and early 2010s, there was a bit of a trend where certain PIs would simply translate German and Russian papers from the early-to-mid 20th century and attribute them to themselves in fields like CS (especially in what became ML).


> It has copyright implications - if Claude can recollect 42% of a copyrighted product without attribution or royalties, how did Anthropic train it?

Personally I’m assuming the worst.

That being said, Harry Potter was such a big cultural phenomenon that I wonder to what degree might one actually be able to reconstruct the books based solely on publicly accessible derivative material.


Why are you talking about Claude and Anthropic?


It’s not unreasonable to suspect they are doing the same. The article starts with a description of a lawsuit NY Times brought against OpenAI for similar reasons. The big difference is that research presented here is only possible with open weight models. OAI and Anthropic don’t make the base models available, so it’s easier to hide the fact that you’ve used copyrighted material by instruction post-training. And I’m not sure you can get the logprobs for specific tokens from their APIs either (which is what the researchers did to make the figures and come up with a concrete number like 42%)


Good call! I brain farted and wrote Claude/Anthropic instead of Meta/Llama.


So if I memorized Harry Potter the physical encoding which definitely exists in my brain is a copyright violation?


> the physical encoding which definitely exists in my brain is a copyright violation

First of all, we don't really know how the brain works. I get that you're being a snarky physicalist, but there's plenty of substance dualists, panpsychsts, etc. out there. So, some might say, this is a reductive description of what happens in our brains.

Second of all, yes, if you tried to publish Harry Potter (even if it was from memory), you would get in trouble for copyright violation.


Right but the physical encoding already exists in my brain or how can I reproduce it in the first place? We may not know how the encoding works but we do know that an encoding exists because a decoding is possible.

My question is… is that in itself a violation of copyright?

If not then as long as LLMs don’t make a publication it shouldn’t be a copyright violation right? Because we don’t understand how it’s encoded in LLMs either. It is literally the same concept.


To me the primary difference between the potential "copy" that exists in your brain and a potential "copy" that exists in the LLM, is that you can't make copies and distribute your brain to billions of people.

If you compressed a copy of HP as a .rar, you couldn't read that as is, but you could press a button and get HP out of it. To distribute that .rar would clearly be a copyright violation.

Likewise, you can't read whatever of HP exists in the LLM model directly, but you seemingly can press a bunch of buttons and get parts of it out. For some models, maybe you can get the entire thing. And I'm guessing you could train a model whose purpose is to output HP verbatim and get the book out of it as easily as de-compressing a .rar.

So, the question in my mind is, how similar is distributing the LLM model, or giving access to it, to distributing a .rar of HP. There's likely a spectrum of answers depending on the LLM


> that exists in the LLM, is that you can't make copies and distribute your brain to billions of people.

I can record myself reciting the full Harry Potter book then distribute it on YouTube.

Could do the exact same thing with an LLM. The potential for distribution exists in both cases. Why is one illegal and the other not?


> I can record myself reciting the full Harry Potter book then distribute it on YouTube.

At this point you've created an entirely new copy in an audio/visual digital format and took the steps to make it available to the masses. This would almost certainly cross the line into violating copyright laws.

> Could do the exact same thing with an LLM. The potential for distribution exists in both cases. Why is one illegal and the other not?

To my knowledge, the legality of LLMs are still being tested in the courts, like in the NYT vs Microsoft/OpenAI lawsuit. But your video copy and distribution on YouTube would be much more similar to how LLMs are being used than your initial example of reading and memorizing HP just by yourself.


> I can record myself reciting the full Harry Potter book then distribute it on YouTube

Not legally you can't. Both of your examples are copyright violations


Recording yourself is not a violation, only publishing on Youtube. Content generated with LLMs are not a violation. Publishing the content you generated might be.


Generating the content for the user is the distribution regardless of what the user does with it


copyright is actually not as much about right to copy as it is about redistribution permissions.

if you trained an LLM on real copyrighted data, benchmarked it, wrote up a report, and then destroyed the weight, that's transformative use and legal in most places.

if you then put up that gguf on HuggingFace for anyone to download and enjoy, well... IANAL. But maybe that's a bit questionable, especially long term.


I don’t think the lawyers are going to buy arguments that compare LLMs with human biology like this.


You are not selling or distributing copies of your brain.


If you perform it from memory in public without paying royalties then yes, yes it is.

Should it be? Different question.


The end of "Fahrenheit 451" set a horrible precedent. Damn you, Bradbury!


Only if you charge someone to reproduce it for them


I think humans get a special exception in cases like this


No they don't. Commercial intent is what is prosecuted in IP law.


maybe if you re wrote it from memory.


That's got nothing to do with it. It's all about copyright. Can it reproduce its training data verbatim? If so, Meta is in hot water.


But if it's corpora do NOT include the Harry Potter books then Meta is NOT in hot water,! So take the Harry Potter books out of the corpora. What is lost? Nothing IMO useful other than the ability to discuss Harry Potter books. BFD.


I read harry potter, and you ask me about a page, and I can recite it verbatim, did I just commit copyright infringement?


Are you selling your ability to recite stuff? Then certainly.


there are plenty of open source LLMs trained on harry potter, is that fine?


No


I pay for a service. The service recites a novel to me. The service would need permission to do this or it is copyright infringement.


This is an extremely common strawman argument. We're not discussing human memory.


> While the Harry Potter series may be fun reading, it doesn't provide information about anything that isn't better covered elsewhere

To address this point, and not other concerns: the benefits would be (1) pop culture knowledge and (2) having a variety of styles of edited/reasonably good-quality prose.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: