Hacker Newsnew | past | comments | ask | show | jobs | submit | iwintermute's commentslogin

Doesn't work in Chrome even after confirming age and accepting that it can contain 'topics related to self-harm'

Great new feature! Such impact! Happy perf!


The audacity and randomness of it - System of a Down with the lyrics 'I don't think you trust In my self-righteous suicide' in Chop Suey plays fine in the same playlist.

No objections about 'Hurt' by NIN/Johnny Cash also.


I know it's off-topic but maybe "MAANG level app" >> "MANGA" if we want go there?


But if F->M, then why not G->A? MAAAN.


Referring to the tech powers that be as "The MAAAN" is pretty fitting.


I like this, I suppose under this new acronym cutting Google and Facebook out of your life would be "sticking it to the MAAAN".


If you're going to rename F to M, why not rename G to A?


I've got 001 model and like it so much that getting 100 as the second one.


there're conversions kits: https://youtu.be/VZws7kE3U5k


I feel like if you have to hide your face behind the mask - there's no need to be present in the office at all. Video call would actually be better.


have you seen HandMadeHero? https://handmadehero.org/

Plus what do you mean by coding games specifically? Is it game engine programming? Game design? Or other related stuff?


It's solving 'mechanical' problem. The optimistic twist on this helper is that it just raises the bar - human programmer should better be more useful than 'brainless' code generator - meaning not only being able to write a loop or solve leetcode task, but also understand context and what he's trying to solve for.

As you say typing code is not a bottleneck for problem solving


So if it was trained using "source code from publicly available sources, including code in public repositories on GitHub." was it also GPLv2?

So everything generated also GPLv2?


You bring up a really good point. I'm super curious what the legality and ethics around training machines on licensed or even proprietary code would be. IIRC there are implications around code you can build if you've seen proprietary code (I remember an article from HN about how bash had to be written by someone who hadn't seen the unix shell code or something like that).

How would we classify that legally when it comes to training and generating code? Would you argue the machine is just picking up best practices and patterns, or would you say it has gained specifically-licensed or proprietary knowledge?


I would argue that a trained model falls under the legal category of "compilation of facts".

More generally, keep in mind that the legal world, despite an apparent focus on definition is very bad at dealing with novelty, and most of it will end up justifying a posteriori existing practices.


You might argue that, but you would likely be wrong.

Even a search engine is not merely a "compilation of facts". A trained model is the result of analysis and reasoning, albeit automated.


A search engine provides snippets of other data. You can point explicitly to where it got that text from. A trained model generates its own new data, from influence of millions of different sources. It's entirely different.


> (I remember an article from HN about how bash had to be written by someone who hadn't seen the unix shell code or something like that).

I believe you're referring to Clean Room Design[1].

[1] https://en.wikipedia.org/wiki/Clean_room_design


This is a bit tricky, because at least in the U.S., I don't believe it's settled question in law yet. Some of the other posters on here have said that the resulting model isn't covered by GPL--that's partially true, but provenance of data, and the rights to it, definitely does matter. A good example of this was the Everalbum ruling, where the company was forced to delete both the data and the trained models used they were used to generate due to lack of consent from the users from whom the data was taken[1]. Since open source code is, well, open, it's definitely less a problem for permissively-licensed code.

That said, copyright is typically generally assigned to the closest human to the activation process (it's unlikely that Github is going to try to claim the copyright to code generated by Copilot over the human/company pair-programming with it), but since copyleft in general is a pretty domain-specific to software, afaik the way that courts interpret the legality of using code licensed under those terms in training data for a non-copyleft-producing model is still up in the air.

Obligatory IANAL, and also happy to adjust this info if someone has sources demonstrating updates on the current state.

[1] https://techcrunch.com/2021/01/12/ftc-settlement-with-ever-o...



> The case debates the legal right for Google to use copyrighted books in its training database in order to train its Google Book Search algorithm

That's not even remotely the same thing.


until the legal position is clear it you'd have to be insane to allow output from this process to be incorporated into your codebases

imagine if the output was ruled as being GPLv2, then having to go through a proprietary codebase trying to rip out these bits of code

it would be basically impossible


No, a model trained on text covered by a license is not itself covered by the license, unless it explicitly copies the text (you cannot copyright a "style").


But it actually is explicitly copying the text. That's how it works. The training data are massive, and you will get long strings of code that are pulled directly from that training data. It isn't giving you just the style. It may be mashing together several different code examples taking some text from each. That's called "derivative work".


No, that's not how it works.

"[...] the vast majority of the code that it suggests is uniquely generated and has never been seen before. We found that about 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set"

https://copilot.github.com/#faqs


If that's the case (only 0.1%), the developers must have done something that differs from other openai experiments that suggest code sequences that I recall seeing, where significant chunks of code from Stack Overflow or similar sites were appearing in answers.


So you're gambling on whether that the code that was generated or copied.


No you aren't. Courts will consider it fair use.


How are you going to prove it was the AI that generated the GPL licensed function ad verbatim from another project, rather than you just opening that project and copying the function yourself?


I will not. Courts will simply consider a single function not to be substantive enough piece of work to constitute unfair use.


use a bloom filter to skip/regenerate that 0.1%


Synthesising material from various sources isn't copyright infringement, that's called writing.

It's only infringement if the portion copied is significant either absolutely or relatively. A line here or there of the millions in the Linux kernel is okay. A couple of lines of a haiku is not. Copyright is not leprosy.


Google Books actually displays full pages of copyrighted works Google did not license. It was considered legal.

[1] https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....


We all don't have Google resources. What if someone comes after us individually because some model-generated code is near identical to code in a GPL codebase? Where is the liability here?

edit: from https://copilot.github.com/

> What is my responsibility when I accept GitHub Copilot suggestions?

> You are responsible for the content you create with the assistance of GitHub Copilot. We recommend that you carefully test, review, and vet the code, as you would with any code you write yourself.

Well, that solves that question.


We are all vulnerable to predatory lawyer trolls, whether we do things correctly or not. If you are accused of reusing a GPL code, then you ask clarification on which and you rewrite. It is likely to be just a snippet. I doubt Copilot would write a whole lib by copying it from another project.

And yes, of course github is not going to take responsibility for things you do with their tools.


If you learn programming from Stack Overflow and Github, and then repeat something that you learned over your time at reading, that's not just copying text. That's having learned the text. You could say the human brain is mashing together several different code examples, taking some text from each.


hmm, so let's think this through.

Wouldn't that imply that a person who learned to code on GPLv2 sources wrote writes some more code in that style (including "long strings of code", some of which are clearly not unique to GPL) is writing code that is "born GPLv2"?

I don't think it currently works that way.


My guess is that it is, if we think of a machine learning framework as a compiler and the model as compiled code. Compiled GPL code is still GPL, that's the entire point.

Anyways, GitHub is Microsoft, and Microsoft has really good lawyers so I guess they did everything necessary to make sur that you can use it the way they tell you so. The most obvious solution would be to filter by LICENSE.txt and only train the model with code under permissive licenses.


> you cannot copyright a "style"

This line of thinking applies to the code generated by the model, but not necessarily to the model itself, or the training of it.


Thanks- in retrospect, I shoudl have explicitly said "code generated by the model".


The trained model is a derivative work that contains copies of the corpus used for training embedded in the model. If any of the training code was GPL the output is now covered by GPL. The music industry has already done most of the heavy lifting here in terms of scope and nature of derived works, and while IANAL I would not suggest that it looks good for anyone using this tool if GPL code was in the training set.


Well, it probably is explicitly copying at least some subset of the source text - otherwise the code would be syntactically invalid, no?


I can't say what's happening in GitHub Copilot, but it's not necessarily true that the only way to produce syntactically valid outputs is to take substrings of the source text. It is possible to learn something approximating a generative grammar.

Take a look at https://karpathy.github.io/2015/05/21/rnn-effectiveness/

At the same time, I would not be surprised if there are outputs that do correspond to the source training data.


Strictly speaking, you could train a model which does not contain the original source text (just the underlying language structure and work tokens), and generates ASCII strings which are consistent with the underlying generative model, that are also always valid code. I expect to see code generator models that explicitly generate valid code as part of their generalization capability.


There will almost certainly be cases where it copies exact lines. When working with GPT2 I got whole chunks of news articles.


I seem to remember a similar discussion on Intellicode (similar thing, but more like Intellisense, and as Visual Studio plugin), which is trained on "github projects with more than 100 stars". IFIR they check the LICENSE.txt file in the project and ignore projects with an "incompatible" license. I don't have any links handy which would confirm this though.


Could it be this? https://visualstudio.microsoft.com/services/intellicode/

I was wondering the same thing, especially with MS being behind both.

edited: or this? https://docs.microsoft.com/en-us/visualstudio/intellicode/cu...


My guess would be that the model itself (and the training process) could have different legal requirements compared to the code it generates. The code generated by the model is probably sufficiently transformative new work that wouldn't be GPL (it's "fair use").

I suspect there could be issues on the training side, using copyrighted data for training without any form of licensing. Typically ML researchers have a pretty free-for-all attitude towards 'if I can find data, I can train models on it.'


No, the code generated is what copyright law calls a derivative work and you should go ask Robin Thicke and Pharrell Williams exactly how much slack the courts give for 'sufficiently transformative new work.


My bet is that copyright law has not caught up with massive machine learning models that partially encode the training data, and that there will still be cases to set legal precedent for machine learning models.

Note also that it's not just a concern for copyright, but also privacy. If the training data is private, but the model can "recite" (reproduce) some of the input given an appropriate query, then it's a matter of finding the right adversarial inputs to reconstruct some training data. There are many papers on this topic.


It is almost certainly the case that current IP law is very unsettled when it comes to machine learning models and mechanisms that encode a particular training set into the output or mechanism for input transformation. What should probably scare the shit out of people looking to commercialize this sort of ML is that the most readily available precedents for the courts to look at are from the music industry, and some of the outcomes have truly been wacky IMHO. The 'blurred lines' case is the one that should keep tech lawyers up at night, because if something like that gets applied to ML models the entire industry is in for a world of pain.


You're missing the fair use aspects. Check out this article on fair use [0].

> In 1994, the U.S. Supreme Court reviewed a case involving a rap group, 2 Live Crew, in the case Campbell v. Acuff-Rose Music, 510 U.S. 569 (1994)... It focused on one of the four fair use factors, the purpose and character of the use, and emphasized that the most important aspect of the fair use analysis was whether the purpose and character of the use was "transformative."

It has some neat examples and explanation.

[0] https://www.nolo.com/legal-encyclopedia/fair-use-what-transf...


There are far more current precedents that apply here, and they do not trend in Github's favor -- as I noted previously, Williams v. Gaye (9th Cir. 2017) is going to be very interesting in this case. I am sure several people in Microsoft's legal department set parameters on the model training and that they felt that they were standing on solid ground, but I am also sure that there are a few associate professors in various law schools around the country who are salivating at the opportunity to take a run against this and make a name for themselves.


> So everything generated also GPLv2?

Almost certainly not everything.

But possibly things that were spit out verbatim from the training set, which the FAQ mentions does happen about .1% of the time [1]. Another comment in this thread indicated that the model outputs something that's verbatim usable about 10% of the time. So, taking those two numbers together, if you're using a whole generated function verbatim, a bit of caveat emptor re: licensing might not be the worst idea. At least until the origin tracker mentioned in the FAQ becomes available.

[1] https://docs.github.com/en/early-access/github/copilot/resea...

[2] "GitHub Copilot is a code synthesizer, not a search engine: the vast majority of the code that it suggests is uniquely generated and has never been seen before. We found that about 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set. Here is an in-depth study on the model’s behavior. Many of these cases happen when you don’t provide sufficient context (in particular, when editing an empty file), or when there is a common, perhaps even universal, solution to the problem. We are building an origin tracker to help detect the rare instances of code that is repeated from the training set, to help you make good real-time decisions about GitHub Copilot’s suggestions."


I think this would fall under any reasonable definition of fair use. If I read GPL (or proprietary) code as a human I still own code that I later write. If copyright was enforced on the outputs of machine learning models based on all content they were trained on it would be incredibly stifling to innovation. Requiring obtaining legal access to data for training but full ownership of output seems like a sensible middle ground.


Certainly not. If I memorize a line of copyrighted code and then write it down in a different project, I have copied it. If an ML model does the same thing as my brain - memorizing a line of code and writing it down elsewhere - it has also copied it. In neither case is that "fair use".


1) this is not human, it's some software

2) if I write a program that copies parts of other GPL licensed SW into my proprietary code, does that absolve me of GPL if the copying algorithm is complicated enough?


Clearly this requires some level of judgement but this isn't new, determining what is plagiarism and not requires a similar judgement call.


What if I put a licence on my Github-repositories that explicitly forbids the use of my code for machine-learning models?


My interpretation of the GitHub TOS section D4 would give give GitHub the right to parse your code and/or make incedental copies regardless of what your license states.

https://docs.github.com/en/github/site-policy/github-terms-o...

This is the same reason it doesn’t matter if you put up a license that forbids GitHub from including you in backups or the search index.


Then the person training the models wouldn't be legally accessing your code.


And so it begins: We start applying human rights to AIs.

Not a critique on your point, which a was just about yo bring up myself.


IMO the closest case is probably the students suing turnitin a number of years ago, which iParadigms (the turnitin maker) won [1].

I think this is definitely a gray area and in some way iParadigms winning (compared to all the cases decided in favour of e.g. the music industry), shows the different yardsticks being used for individuals and companies.

I'm sure we will see more cases about this.

[1] https://www.plagiarismtoday.com/2008/03/25/iparadigms-wins-t...


Is what a human generates GPLv2 because it learned from GPLv2 code?


What if a human copies GPLv2 code?



When is it copying? What about all those stack overflow snippets I copied?!


Congrats, you've just discovered why many employers block or forbid stackoverflow.


OMG.

There is no such thing as "fair use" as we have in copyright law?


You can train models on copyrighted materials. https://towardsdatascience.com/the-most-important-supreme-co...


IANAL, but my interpretation of the GitHub TOS section D4 would give give GitHub the right to parse your code and/or make copies regardless of what your license states. This is the same reason the GitHub search index isn’t GPL contaminated.


Developers Human brain are also trained with propietary code bases, then when they quit and go elswere , they program using knowledge learned previously, yet you do not sue them.


We kinda have to accept that - we don't have to accept this. You can't interface with humans but you can interface with one of the biggest corporate tech giants straight up leeching from explicitly public and free work for their own private benefit.


Its not cause the License says nothing about training. I mean every oss dev's brain would be under GPL then.


https://en.wikipedia.org/wiki/Clean_room_design

There're definitely cases when devs avoid even looking at implementation before creating their own


I feel like it would be great open source product idea



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: