This isn't my experience with Co-pilot's suggestions. I've literally been able t...

jacoblambda · on Nov 6, 2022

Worth noting with models like copilot is that if you deliberately give it an input similar to the training contents, odds are it'll near verbatim reiterate it.

The main issue is that while you can use copilot to create "new"/transformative code, it's also trivial to get it to pump out licensed works in a form where you could claim "I didn't know it was taken from x project with y license because the tool made it for me".

I personally have no problem with copilot in concept however to do it (or any other AI model based text/graphics tool) without infringing on people's copyrights is practically an unsolved problem (excluding just per-licensing the training data ahead of time).

withinboredom · on Nov 6, 2022

I mean, you can prompt me (or any other engineer) to spit out copyrighted code. FizzBuzz comes to mind… as do a number of algorithms I’ve written in the past which belongs to my past employers…

I really think we are entering some interesting territory that will likely be an interesting can of worms.

Brian_K_White · on Nov 6, 2022

There is somehow something different if you knew the entire code to quickbooks from memory, and had an api where I could request any 10-line chunk of it I wanted, as many times as I wanted.

withinboredom · on Nov 6, 2022

So you’re saying that how fast someone can type and how well they can recall makes a difference? I don’t type over 120 words per min like my grandma, but I have a photographic memory. I can tell you what file & line a chunk of code belongs to, or spit it out verbatim, customized to the current problem I’m working on.

So, you’re saying I can’t work in this industry? That seems a bit harsh.

wizzwizz4 · on Nov 6, 2022

> So, you’re saying I can’t work in this industry? That seems a bit harsh.

If you're going to be spitting copyrighted code out in violation of any licenses it might be made available under… yes, you can't. Most employers would not appreciate that behaviour. But I doubt you actually do this, even though you're capable of it. You reason about your code; you're not just being a predictive text engine.

Brian_K_White · on Nov 6, 2022

It sounds like for you in particular, yes, since you seem to want to go out of your way to find any way to violate copyright, even when the terms are intentionally generous. Indeed such a person should not work in this industry, though I'm sure there are plenty of employers who are happy to have you steal for them, so you will be able to regardless.

withinboredom · on Nov 6, 2022

My point was that we all do this /not on purpose/ (and for me being an exception, can make sure I don’t personally). But when I see code that existed in another company with some variables changed, I don’t flag it. There are only so many ways to describe a chair, are they all copyrighted?

Brian_K_White · on Nov 6, 2022

"My point was that we all do this /not on purpose/"

I don't concede the equivalence.

dev_tty01 · on Nov 6, 2022

It's really simple. If you are outputting licensed code and not abiding by the terms of the license, then yes, that is a problem.

heavyset_go · on Nov 6, 2022

Companies already pay a lot of money for datasets to train models on in other spaces outside of software development. On top of that, they spend a lot of money on labelling and what not.

Software is unique in that there is a cultural trend to share source code, so that makes it easy to compile into "free" datasets.

I wouldn't say it's an unsolved problem, it's just that there are no incentives to compile or pay for datasets when Microsoft already has petabyes of code to train on. If anything, I expect Microsoft to sell datasets based on GitHub repositories if Copilot-like models survive this lawsuit and are conmoditized.

cbzbc · on Nov 6, 2022

Not totally unique in that respect, the situation doesn't seem too dissimilar from the one that led shutterstock to launch their contributor fund.

heavyset_go · on Nov 6, 2022

Commoditized*