Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This isn't my experience with Co-pilot's suggestions. I've literally been able to have Co-pilot suggest a complete unit test based on a novel structure I hand-coded myself and a few words describing the unit test. The constants are often wrong, but it saves minutes of fidgeting with the syntax for unit tests and assertions.

These are not quotations from other people's code but something about the deep structures of language and programming language semantics. However, I suspect if you knew enough of a snippet from other source you could coax Co-pilot to suggest code learned from that source, but it would likely be washed over by other code in the corpus where it coincided with meanings.



Worth noting with models like copilot is that if you deliberately give it an input similar to the training contents, odds are it'll near verbatim reiterate it.

The main issue is that while you can use copilot to create "new"/transformative code, it's also trivial to get it to pump out licensed works in a form where you could claim "I didn't know it was taken from x project with y license because the tool made it for me".

I personally have no problem with copilot in concept however to do it (or any other AI model based text/graphics tool) without infringing on people's copyrights is practically an unsolved problem (excluding just per-licensing the training data ahead of time).


I mean, you can prompt me (or any other engineer) to spit out copyrighted code. FizzBuzz comes to mind… as do a number of algorithms I’ve written in the past which belongs to my past employers…

I really think we are entering some interesting territory that will likely be an interesting can of worms.


There is somehow something different if you knew the entire code to quickbooks from memory, and had an api where I could request any 10-line chunk of it I wanted, as many times as I wanted.


So you’re saying that how fast someone can type and how well they can recall makes a difference? I don’t type over 120 words per min like my grandma, but I have a photographic memory. I can tell you what file & line a chunk of code belongs to, or spit it out verbatim, customized to the current problem I’m working on.

So, you’re saying I can’t work in this industry? That seems a bit harsh.


> So, you’re saying I can’t work in this industry? That seems a bit harsh.

If you're going to be spitting copyrighted code out in violation of any licenses it might be made available under… yes, you can't. Most employers would not appreciate that behaviour. But I doubt you actually do this, even though you're capable of it. You reason about your code; you're not just being a predictive text engine.


It sounds like for you in particular, yes, since you seem to want to go out of your way to find any way to violate copyright, even when the terms are intentionally generous. Indeed such a person should not work in this industry, though I'm sure there are plenty of employers who are happy to have you steal for them, so you will be able to regardless.


My point was that we all do this /not on purpose/ (and for me being an exception, can make sure I don’t personally). But when I see code that existed in another company with some variables changed, I don’t flag it. There are only so many ways to describe a chair, are they all copyrighted?


"My point was that we all do this /not on purpose/"

I don't concede the equivalence.


It's really simple. If you are outputting licensed code and not abiding by the terms of the license, then yes, that is a problem.


Companies already pay a lot of money for datasets to train models on in other spaces outside of software development. On top of that, they spend a lot of money on labelling and what not.

Software is unique in that there is a cultural trend to share source code, so that makes it easy to compile into "free" datasets.

I wouldn't say it's an unsolved problem, it's just that there are no incentives to compile or pay for datasets when Microsoft already has petabyes of code to train on. If anything, I expect Microsoft to sell datasets based on GitHub repositories if Copilot-like models survive this lawsuit and are conmoditized.


Not totally unique in that respect, the situation doesn't seem too dissimilar from the one that led shutterstock to launch their contributor fund.


Commoditized*




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: