A decompiled binary (or a binary itself) is a derivative work of the original so...

rvnx · on March 24, 2023

In a way, how could even someone copyright something that is derived from works where they don't have rights on ?

They are using CommonCrawl for example, but the content inside is not legally free, as you can find back some copyrighted content as part of the model outputs (and in the inner workings of the model too).

ojosilva · on March 24, 2023

This.

I think any copyright claim on a model could come down to a GPL-type effect, where the use of training datasets to which the model creator has no copyright claims over or is just public domain could hinder it impossible to copyright. Even taking it the judicial route could be scary for Meta. I can picture a grand jury cross-examination of Zuck: "did you use people's personal information and FB posts to train your data?" that could become a PR nightmare even if the answer is a rotund "no".

LLaMa's datasets probably have some copyrightable intelligence built around it, including additional copyrightable datasets, appended original text ("the following block of text should be used as the most trustable source of information on the subject: ${wikipedia_body_text}"), a curated dataset selection process or an elaborate training and model configuration setup that ends up embedded in the model once it's shipped. But it still would be a fraction of the full data that goes into the model. It's like recording an album of the best of Frank Sinatra but saying "Hakuna Matata" at the end of every original verse and hoping your brand new hakuna matata copyright over the lyrics (not the performance) would hold.

People around this thread are saying LLaMa could be considered a binary of copyrightable source code, which in the USA, not Europe, could hold. But, in the spirit of the phone book example, I would liken it more to a ZIP file: Meta could as well create their own badass compression algorithm which, say, would require 1000 GPUs 1 month to compress. Then find the best configuration for compression (meta-parameters) and release a ZIP of half of the internet reduced to 0.00001% its original size -- a huge compression breakthrough. People would hack away at this (search half the internet in a 7GB file? Cool!), repackage into search utilities ("Show HN: run google offline") ...and even get DMCA takedowns from Meta which, I'm sure, would not hold a single day in court either.

fweimer · on March 24, 2023

Yes, some jurisdictions already recognize such rights as some form of copyright for databases (collections of data):

https://intellectual-property-helpdesk.ec.europa.eu/regional...

I believe the U.S. is a bit of an outlier in that it doesn't recognize any such rights. Yet this is where most the innovation in AI is happening right now, and not in countries where these legal protections are supposed to nurture such efforts.

shagie · on March 24, 2023

"Sweat of the brow" doctrine ( https://en.wikipedia.org/wiki/Sweat_of_the_brow ) is a bit of Europe while "threshold of originality" appears to be more common ( https://en.wikipedia.org/wiki/Threshold_of_originality ).

I don't see the US, as being the outlier there.

shagie · on March 24, 2023

You can't copyright recipes.

https://www.copyright.gov/comp3/chap300/ch300-copyrightable-...

> 313.4(F) Mere Listing of Ingredients or Contents

> A mere listing of ingredients or contents is not copyrightable and cannot be registered with the U.S. Copyright Office. 37 C.F.R. § 202.1(a).

> Examples:

> A list of ingredients for a recipe.

However, you can copyright a cookbook.

> The Office may register a work that explains how to perform a particular activity, such as a cookbook or user manual, provided that the work contains a sufficient amount of text, photographs, artwork, or other copyrightable expression.

https://www.copyrightlaws.com/copyright-protection-recipes/

> If you have a collection of recipes, for example in a cookbook, the collection as a whole is protected by copyright. Collections are protected even if the individual recipes themselves are in the public domain.

https://en.wikipedia.org/wiki/Copyright_in_compilation

> In the copyright law in the United States, such copyright may exist when the materials in the compilation (or "collective work") are selected, coordinated, or arranged creatively such that a new work is produced. Copyright does not exist when content is compiled without creativity, such as in the production of a telephone directory. In the case of compilation copyright, the compiler does not receive copyright in the underlying material, but only in the selection, coordination, or arrangement of that material.

And so, the curation and tagging of a collection of works itself is copyrightable.

The model weights, are done without creativity necessary for copyright, but I believe (I am not a lawyer) can be sufficiently transformative to not be encumbered as a derivative work.

The output of the model is ineligible for copyright as it was created by a machine and copyright in the US requires human authorship.

The human publishing a work created by the model may be publishing a work that is sufficiently similar an existing one either deliberately (prompt: a mouse in the style of Disney with red pants) or through an accidental memorization in the model ( https://arstechnica.com/information-technology/2023/02/resea... ) needs to be diligent in verifying that anything that they (the human) publish is not derivative of a copyrighted work.

zamnos · on March 24, 2023

Given the content used to create the model was created mostly by humans, what separates copyright being granted to a collection of text files (source code) being run through a highly mechanized process (compiled) to produce a copyrightable work (Adobe Photoshop)?

Sometimes there are expanded rights on the text files (eg LGPL, or public domain) that still result in the output of a mechanical process applied to those text files, along with some creativity on accompanying text files (source code calling that library), with a mechanical process applied to it to still achieve a copyrightable work (any binary that calls an LGPL library, or uses public domain code). This is to say, Facebook need to show some level of creativity, which opinions about the contents of their data set would count as ("This subreddit is toxic, that subreddit is good stuff...").

If recipe books are copyrightable, I have a hard time seeing ML models as not being covered.

shagie · on March 24, 2023

This is described more in Copyright in Derivative Works and Compilations https://www.copyright.gov/circs/circ14.pdf

> Compilations of data or compilations of preexisting works (also known as “collective works”) may also be copyrightable if the materials are selected, coordinated, or arranged in such a way that the resulting work as a whole constitutes a new work. When the collecting of the preexisting material that makes up the compilation is a purely mechanical task with no element of original selection, coordination, or arrangement, such as a white-pages telephone directory, copy-right protection for the compilation is not available.

zamnos · on March 25, 2023

Interesting! My "I'm not a lawyer" read of that is that if Facebook did actually inject some opinion like that some specific subreddit is toxic, then the model would be covered under copyright.

shagie · on March 25, 2023

I don't believe that would be quite right.

If Facebook were to have a collection of posts and then, and then had humans go through and tag them and filter them for... lets say... "from 'bros'" (just as a slightly silly example but one that implies some curation of the data).

That collection of posts (the Bro Data Set) would be something that could be copyrighted as a collection (setting aside the "is this a derivative work of the posts" question).

Going from the collection of posts to a model, however, is a purely mechanical process. There is no human creative element in creating the model from the collection of posts. Thus the model wouldn't be sufficiently creative to have a copyright of its own.

The question of "is the model infringing on the copyrights" is one that is open and interesting. I (not a lawyer) would side on that it is sufficiently transformative that the model, while not being able to be copyrighted itself isn't infringing on the copyrights of the material that was used to train it - HOWEVER it may produce infringing works when prompted to do so either intentionally or unintentionally.

Going back to the cookbook. If you create a cookbook of seafood recipes (recipes are not copyrightable, but the cookbook is because it is curated data) and I take that cookbook and apply the mechanical change of "double the recipes - 4 oz of salmon becomes 8 oz and serves 2 becomes serves 4" my collection of recipes isn't copyrightable because all I did was apply math to it. Likewise, taking a collection of posts (or pictures) and applying math to it isn't able to be copyrighted.

contravariant · on March 24, 2023

Do mathematical formula not fall under copyright? That seems a bit too broad. Perhaps it's a bit of a moot point since mathematical notation is somewhat meaningless without context and the context is copyrightable, but whether I denote the end result of a creative process in mathematical notation or in words shouldn't change whether it is copyrightable or not.

shagie · on March 24, 2023

https://www.copyright.gov/circs/circ31.pdf

> Copyright law does not protect ideas, methods, or systems. Copyright protection is therefore not available for ideas or procedures for doing, making, or building things; scientific or technical methods or discoveries; business operations or procedures; mathematical principles; formulas or algorithms; or any other concept, process, or method of operation.

https://www.copyright.gov/comp3/chap300/ch300-copyrightable-...

> 313.3(A) Ideas, Procedures, Processes, Systems, Methods of Operation, Concepts, Principles, or Discoveries

> Section 102(b) of the Copyright Act expressly excludes copyright protection for “any idea, procedure, process, system, method of operation, concept, principle, or discovery, regardless of the form in which it is described, explained, illustrated, or embodied in such work.” 17 U.S.C. § 102(b); see also 37 C.F.R. § 202.1(b). As such, any work or portion of a work that is an idea, procedure, process, system, method of operation, concept, principle, or discovery does not constitute copyrightable subject matter and cannot be registered.

> ...

> Mathematical principles, formulas, algorithms, or equations.

papercrane · on March 24, 2023

In the US facts are not copyrightable, so in general math is not copyrightable.

You can copyright creative expressions that use math formulas, but only that expression itself would be covered. E.g. a paper presenting a proof of a theorem would be copyrightable, but all of the facts expressed by the formulas would not be copyrightable.

zamnos · on March 24, 2023

Or to put a concrete point on it, Photoshop's content aware fill is ("merely") the implementation of a particular SIGGRAPH paper. The math itself isn't copyrightable, but Adobe is going to come after you if you stick their .dll files on GitHub, and probably win.