Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I’m simultaneously surprised and unsurprised that announcements about Copilot get so much copyright discussion, while the GPT-like models don’t get nearly the same. Meanwhile, GPT-J is literally trained on pirated books (the books3 corpus is part of the Pile, which is the corpus this was trained on).

Charitably, it’s because licenses are already such a core discussion when github comes up.

Uncharitably, it’s because Copilot uses “our” community’s labor, while the GPTs use others’.



Part of the difference like the other commenter mentioned is that Copilot isn't open source while basically everything except the final model is for the GPT models.

The other aspect of it is in application. GPT-3 isn't particularly aimed at using the generated output in works. Rather it exists more as an experiment than anything else. Where the works are used they are generally non-commercial, not used in the final product, or are transient and don't actually stick around (i.e. AI dungeon).

This is compared to Copilot which, while in beta, is very much being marketed as a programming utility to help you write code. This comes with the implication that said code will be used in the final product. If GPT-3 was being used as a writing aid (not just brainstorming but actually writing), then I think we would be seeing a very different discussion around it.

Another consideration (which I'm not sure how true it is but I'm inclined to believe) is that programming text tends to have a smaller resolution at which it becomes "unique" or can be matched against a source as a copyright infringement. I may be wrong about this and copilot may just be poorly trained or designed by comparison but it seems far harder to identify outright copied text from GPT-3 (that isn't quoted/attributed). I'm sure examples exist but from my experience with these text generation tools it seems far harder to get into copyright violation territory.

---

Side note: If Copilot was working at an AST level rather than at a textual level I suspect it would have far less issues with copyright and would be more useful as a tool.


OpenAI is absolutely trying to commercialise GPT-3. But I agree the applications aren't so obviously "here is some text, you can put it in your product".


Part of the copilot discussion was about patents rather than copyright, which doesn't apply to text. Also the concern is less about the legal implications of Copilot itself but those for developers using its output, which are largely the same concerns why we frown on people copy-pasting code from StackOverflow or random Google results (other than quality).

The copyright problem with Copilot is not just the license of the corpus it was trained on, it's also that in many cases it reproduces source material verbatim with no indication that this is happening.

If GPT were to be used to produce fiction books, poetry or lyrics (not simply as an artistic experiment), I'm sure its output would undergo similar scrutiny from people in the respective industries. As it stands, for text it's more likely to see use to generate rough drafts for news articles and blog posts which would need a lot of editing to make useful. It might still reproduce writing styles or idioms but neither of these are subject to copyright in much the same way as lines of code.

Making the output of Copilot useful is more challenging, even if you could magically avoid the legal minefield its training data poses. The quality is hit or miss, but it can introduce subtle bugs and because it doesn't understand the code it generates, you now have to understand it for it, which can be difficult because you didn't even come up with it and there's no-one you can ask.


It’s simply because the output of copilot is indended to be included in commercial projects. That’s when the licensing issues actually matter.

The output of this isn’t really proposed for anything in particular right now. If someone turned this into a tool to help with creative writing or something the exact same issues would be raised.


That doesn't make sense. If the scope is broader, then you can do at least as much infringement compared to if the scope is narrow.


It's because Copilot isn't open source.


How’s that make a difference? Being open source doesn’t make IP issues disappear in any way I can see.


Fair point, OP is APL2, which isn't quite good enough since it's probably using GPLed stuff, but it's better than being closed source.


It’s not just using GPL stuff, it’s also using straight up pirated books, for which APL2 is very definitely insufficient.


Not just "not open source", but Github specifically said they intend to monetize it.


Neither is GPT-3?


One of these promises to justify the billable hours of half the industry for the next decade, the other threatens to eliminate them by the next decade. It really isn't more complicated than that




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: