Is IA something different from AI? Or maybe just the French version ("intelligence artificielle" I imagine...)
My feeling is that a lot of Meta's AI/ML work actually ties into the AR/VR long-term dream. How do you make the so-called metaverse alive? By having people design it themselves. They're not going to do that in Maya, that's for sure. But if they could create virtual spaces and virtual people with Holodeck-style natural language instructions...
Haha, I noticed my coworkers do that when they're speaking Hindi or Chinese too. All the key words are in English, so you step in and say "Oh, you guys are talking about that iterator bug?" and they're shocked: "Wow, you speak Chinese?"
Chinese, unlike Japanese or Korean, lacks a consistent phonetic mechanism for "importing" foreign words. Phonetic approximations are, with few exceptions, limited to foreign names. As such, practically every bit of jargon is semantically translated (examples include 电脑 "electronic brain" for computer, 端口 "end opening" for port), but since most programming is done in English-based languages, it's not uncommon for Chinese programmers to be more familiar with the English jargon.
Interestingly, many glosses using Chinese characters (not all, but many) were from Japan-- sometimes used phonetically (e.g. https://doi.org/10.1163/9789004438200_008), and sometimes used to translate concepts. Chinese imported many such loan words for technology from glosses that were made in Japan with kanji for words which had no existing equivalent in Chinese (https://www.jstor.org/stable/23754127). But then Japan increasingly gave up new words in favor of transliteration, while Chinese continued to have this need. Chinese words that are phonetic approximations of foreign words are perhaps less common, but by no means rare. They're not all from English (e.g., 'grapes' is a transliteration from Indo-Iranian), and sometimes you have to look at the Cantonese pronunciation to hear the resemblance to English, and sometimes they're neatly hidden by a choice of word that fits well phonetically _and_ semantically.
I don't know how you'd better express 'port' than end-opening. After all it refers to the place where boats are loaded and unloaded. Honestly the semantic translation seems a far more apt term. I guess there's something like 港 but that's probably just more confusing.
> Honestly the semantic translation seems a far more apt term.
That's interesting. For me, the port metaphor makes me think data port, which seems like a pretty apt parallel to the flow of goods at a shipping port.
Different metaphors for different folks, I suppose.
IMO it highlights just what a commodity they weights are. If all one needs is the weights to reproduce the work, then where is the value? I mean there is very little moat here. Further, what does it say about consciousness and individuality if we all are simply the values of the weights in our wet neural networks? Or whatever the biological equivalent is?
There's nothing "simply" here. The weights in question are a particular configuration of several gigabytes worth of data. They're not random. Getting anything comparable by randomly generating a number this long is a "total atoms in the universe to the power of total atoms in the universe" kind of a deal.
In abstract terms, those weights are by far the most dense form of meaning we've ever dealt with.
I think the point is more that the primary edge that they (Meta, OpenAI, etc.) have is compute capacity, since the models are easy to implement (and the most of the training data is out there). Once you have the weights, you can do with it as you please.
This is probably why they hate/fight these leaks. Once the model weights are available, they lose much of their competitive advantage.
could ai, just basically brute force better weights? say instead of doing training to get their own weights, maybe have 10 training inputs, test a new set of weights, have gpt4 rate the quality of the output from each set of weights and then try another and maybe infer a pattern to how weights could maybe potentially be created without all the work?
I feel like soon, ai will be it's own researcher and go far beyond what we ever could do on our own.
That’s like saying that creating a song or a movie has no value because after they are created anyone can download a file with it.
However it does open up a point which is: should we allow people to make huge amounts of money by infinitely copying and distributing their own work? Should we be protecting this model?
Even if they were, any data is also only bytes - numbers - and therefore any knowledge, books, any kind of sound or picture would likewise be trivial. But isn‘t.
Unless you believe that you think with your soul or something, what else could you be other than your quantum state, or some close-enough compressed equivalent?
Speak for yourself. What I've seen of all these language models so far is equal to a 10 year old with a good vocabulary and a tenuous grip on reality and no understanding of truth vs lie or right vs wrong.
They could do what Microsoft used to do in the 90s-00s, make pirated Windows/Office available (by turning blind eye on private users) so that they capture and keep the mindshare.
Facebook, which ran over the law in order to be successful now uses the law in the exact opposite sense. Ultimately, it's greed, "What is good for me to do to you is not good for you to do to me."
Oculus is between a rock and a hard place. The rock being consoles that want their own native VR solutions and aren't interested in sharing their cut of games etc and the hard place being Steam.
You just can't compete with Steam. It's been tried. All competitors are mercilessly drowned in the giant fucking moat surrounding GabeN's castle.
PC gamers are fiercely loyal, "generally" more informed then other consumers and absolutely despise any forms of dark patterns. Because Steam has always done right by PC gamers Steam has benefited from that loyalty for a good 2 decades now.
Everyone is welcome in GabeN's castle but y'all gotta pay the toll to cross the drawbridge.
Oculus has thrashed SteamVR, because the hardware side of things fell apart and the Quest 2 is a lot more like a console. It was actually pretty successful in 2021, Zuckerberg is doing his best to kill it with random rebrands and missing why people buy it (stuff like Beat Saber instead of weird replacements for the real world), but Vive/Reverb/whatever is nowhere
Someone should tell him (and all the other metaverse people) that VR is almost always dystopian. It's what everyone gets sucked into when civilization stagnates and there is no opportunity, no culture, and nowhere to go. It belongs in worlds of Malthusian collapse, after a nuclear war, or where inequality is so high the majority of people have reverted to high-tech medieval peasants.
VR has a legitimate niche in gaming but outside that it's just not appealing. It's dystopian and depressing. Nobody wants to spend time in a social network with a helmet on their head being served ads.
The obvious retort is that many people, especially Gen Z, do believe civilization has stagnated, and there is no opportunity, culture, or appealing destinations. I personally think they’re crazy, though I do appreciate them leaving the nice areas empty for me. :)
It seems like VR is less than half of the investment by RL. In Meta's 2022 annual report, they say "Many of our metaverse investments are directed toward long-term, cutting edge research and development for products that are not on the market today and may only be fully realized in the next decade. This includes exploring new technologies such as neural interfaces using electromyography, which lets people control their devices using neuromuscular signals, as well as innovations in artificial intelligence (AI) and hardware to help build next- generation interfaces. ... *in 2023, we expect to spend approximately 50% of our Reality Labs operating expenses on our augmented reality initiatives, approximately 40% on our virtual reality initiatives, and approximately 10% on social platforms and other initiatives.*"
I'm not sure if Horizon falls into "virtual reality" or "social platforms" but it seems to be the latter: "For example, we have launched Horizon Worlds, a social platform where people can interact with friends, ..."
Eh. A lot of creative thinking goes into creating the weights. Photography is just taking a photo, but you still own the rights to it because you clicked the button.
This seems like a big misstep by Meta. I had assumed they were intentionally allowing that torrent to float around, and tacitly encouraging open source to build on their models. It seemed like a way to differentiate themselves from "Open"AI, and I was actually feeling some good will toward them for a second!
Isn't this just to protect them from liability? If someone claims that their LLM hurt them in some way, they can say that they had nothing to do with it and that they tried to prevent its spread.
Also, it could be a maneuver to prevent the genericization of the word LLaMa, which they may want to continue using.
There are definitely some good arguments why it wouldn't be defensible (e.g. the phone book argument), but of course this is all new and remains to be seen.
The other interesting aspect of this is that they're classifying it as unauthorized content distribution. Meta was already distributing the weights, but limited their distribution to "researchers" with approved credentials. It was one of those researchers who leaked the weights originally. So it's not like they were reverse engineered from a binary or exfiltrated out of FB HQ. That might be an important bit of nuance.
The phone book argument is super interesting. I wonder if it would hold, though: Wikipedia suggests it relies on the re-user having a new “selection & arrangement.” Kinda tough to rearrange model weights!
Yeah, it seems that would require a fairly fundamental shift in how copyright is understood in the U.S. Not something that should happen during routine processing of DMCA notices.
The EU is different, it recognizes copyright-like rights in databases and database works, which is why the cavalier attitude of U.S.-oriented organizations to these matters tends to annoy me. For example, the FSF does not actually check that certain non-code data files are legally unencumbered. They merely disclaim any copyright of their own. But for all we know, that could be wishful thinking.
AI weights are as much a mechanical, non-creative result of the training set and the netowrk initial state as other AI outputs are a mechanical, non-creative result of the prompt and the network trained state, so except that the input is bigger for the weights case for the same model as it would be for the output case, I don't see a difference.
I agree. But it's not like we could force companies to release the weights of their models. We might obtain them in other ways, such as unauthorized leaks, or reverse engineering. But if the company didn't intend to release them, does that mean they're protectable as a trade secret? And if so, are trade secrets enforceable in the same way as copyrights?
(Also, another element to this case is that GitHub is owned by Microsoft, who has a conflict of interest with Meta in terms of ChatGPT vs. LLaMA)
Is there precedent on model weights being copyrightable in the first place? I suppose the recipients of the DMCA notices are unlikely to be willing to contest it in court, though.
It's an interesting legal question. US copyright is based around expressiveness and originality (which is why phone books and IBM's logo are not copyright protected.)
An argument might be made that the curation of data that goes into the training set qualifies, but it might depend on how much expressiveness and originality went into the curation.
For example, I could see a court ruling that the weights for a model trained on "all the good music from the 70s" is copyrightable, as someone had to express what they believed was "good" music, but a model trained on a large percentage of the internet without much curation would not.
Of course, nobody really knows until the courts weigh-in on it.
If model weights become non-copyrightable, it'll lead to an incredible shift in the industry.
When model weights leak, anyone can pick them up and run with them. It's not like code, where you have to set up an entire bespoke infrastructure, microservices, data dependencies, etc. Models are crystalized, perfectly distilled functionality with a single interface.
You'll start to see more leaks, companies building off the work of other companies, etc. Part of me thinks this would lead to faster, more distributed innovation.
Trade secrets might still apply, so you might just end up seeing companies not sharing their models with academics.
Meta might've lost trade secret protection here, as they shared the weights with pretty much anyone with an .edu email address. A court might rule that they didn't take enough steps to keep the model weights secret.
They might, but legal contracts are considered, well, legally binding. A lot of software has free/low cost educational editions, protected by access .edu address and a licensing agreement, and that's considered enough to protect copyright on that work. Whether ML models are covered by trade secret and not copyright, and whether that extends to trade secret coverage remains to be seen.
"In regard to collections of facts, O'Connor wrote that copyright can apply only to the creative aspects of collection: the creative choice of what data to include or exclude, the order and style in which the information is presented, etc.—not to the information itself."
That case ruled the phone book in question was not copyright protected:
> The court held that Rural's directory was nothing more than an alphabetic list of all subscribers to its service, which it was required to compile under law, and that no creative expression was involved. That Rural spent considerable time and money collecting the data was irrelevant to copyright law, and Rural's copyright claim was dismissed.
In theory a curated phone book could be copyrighted, e.g. a hypothetical "Best Restaurants in San Francisco" compilation could be copyrighted. However a general phonebook just listing business in alphabetical order does not meet the originality threshold that was laid out in Feist v. Rural.
Most company logo's in the US are only protected by trademarks and not copyright. The U.S. Copyright Office will reject any registrations for logos that are typefaces and simple geometric shapes.
A decompiled binary (or a binary itself) is a derivative work of the original source code and assets which are clearly copyrightable.
While untested, model weights are likely closer to the phone book - a collection of facts. Math equations are similarly unable to be copyrighted. Mechanical translations also fall in the "not able to be copyrighted."
It may be able to copyright the collection of training material (the MNIST dataset is copyrighted).
I am not a lawyer, but I believe that it would be challenging to claim copyright on the models as there is no creativity involved in the model just as there is no creativity in a phone book.
In a way, how could even someone copyright something that is derived from works where they don't have rights on ?
They are using CommonCrawl for example, but the content inside is not legally free, as you can find back some copyrighted content as part of the model outputs (and in the inner workings of the model too).
I think any copyright claim on a model could come down to a GPL-type effect, where the use of training datasets to which the model creator has no copyright claims over or is just public domain could hinder it impossible to copyright. Even taking it the judicial route could be scary for Meta. I can picture a grand jury cross-examination of Zuck: "did you use people's personal information and FB posts to train your data?" that could become a PR nightmare even if the answer is a rotund "no".
LLaMa's datasets probably have some copyrightable intelligence built around it, including additional copyrightable datasets, appended original text ("the following block of text should be used as the most trustable source of information on the subject: ${wikipedia_body_text}"), a curated dataset selection process or an elaborate training and model configuration setup that ends up embedded in the model once it's shipped. But it still would be a fraction of the full data that goes into the model. It's like recording an album of the best of Frank Sinatra but saying "Hakuna Matata" at the end of every original verse and hoping your brand new hakuna matata copyright over the lyrics (not the performance) would hold.
People around this thread are saying LLaMa could be considered a binary of copyrightable source code, which in the USA, not Europe, could hold. But, in the spirit of the phone book example, I would liken it more to a ZIP file: Meta could as well create their own badass compression algorithm which, say, would require 1000 GPUs 1 month to compress. Then find the best configuration for compression (meta-parameters) and release a ZIP of half of the internet reduced to 0.00001% its original size -- a huge compression breakthrough. People would hack away at this (search half the internet in a 7GB file? Cool!), repackage into search utilities ("Show HN: run google offline") ...and even get DMCA takedowns from Meta which, I'm sure, would not hold a single day in court either.
I believe the U.S. is a bit of an outlier in that it doesn't recognize any such rights. Yet this is where most the innovation in AI is happening right now, and not in countries where these legal protections are supposed to nurture such efforts.
> 313.4(F) Mere Listing of Ingredients or Contents
> A mere listing of ingredients or contents is not copyrightable and cannot be registered with the U.S. Copyright Office. 37 C.F.R. § 202.1(a).
> Examples:
> A list of ingredients for a recipe.
However, you can copyright a cookbook.
> The Office may register a work that explains how to perform a particular activity, such as a cookbook or user manual, provided that the work contains a sufficient amount of text, photographs, artwork, or other copyrightable expression.
> If you have a collection of recipes, for example in a cookbook, the collection as a whole is protected by copyright. Collections are protected even if the individual recipes themselves are in the public domain.
> In the copyright law in the United States, such copyright may exist when the materials in the compilation (or "collective work") are selected, coordinated, or arranged creatively such that a new work is produced. Copyright does not exist when content is compiled without creativity, such as in the production of a telephone directory. In the case of compilation copyright, the compiler does not receive copyright in the underlying material, but only in the selection, coordination, or arrangement of that material.
And so, the curation and tagging of a collection of works itself is copyrightable.
The model weights, are done without creativity necessary for copyright, but I believe (I am not a lawyer) can be sufficiently transformative to not be encumbered as a derivative work.
The output of the model is ineligible for copyright as it was created by a machine and copyright in the US requires human authorship.
The human publishing a work created by the model may be publishing a work that is sufficiently similar an existing one either deliberately (prompt: a mouse in the style of Disney with red pants) or through an accidental memorization in the model ( https://arstechnica.com/information-technology/2023/02/resea... ) needs to be diligent in verifying that anything that they (the human) publish is not derivative of a copyrighted work.
Given the content used to create the model was created mostly by humans, what separates copyright being granted to a collection of text files (source code) being run through a highly mechanized process (compiled) to produce a copyrightable work (Adobe Photoshop)?
Sometimes there are expanded rights on the text files (eg LGPL, or public domain) that still result in the output of a mechanical process applied to those text files, along with some creativity on accompanying text files (source code calling that library), with a mechanical process applied to it to still achieve a copyrightable work (any binary that calls an LGPL library, or uses public domain code). This is to say, Facebook need to show some level of creativity, which opinions about the contents of their data set would count as ("This subreddit is toxic, that subreddit is good stuff...").
If recipe books are copyrightable, I have a hard time seeing ML models as not being covered.
> Compilations of data or compilations of preexisting works (also known as “collective works”) may also be copyrightable if the materials are selected, coordinated, or arranged in such a way that the resulting work as a whole constitutes a new work. When the collecting of the preexisting material that makes up the compilation is a purely mechanical task with no element of original selection, coordination, or arrangement, such as a white-pages telephone directory, copy-right protection for the compilation is not available.
Interesting! My "I'm not a lawyer" read of that is that if Facebook did actually inject some opinion like that some specific subreddit is toxic, then the model would be covered under copyright.
If Facebook were to have a collection of posts and then, and then had humans go through and tag them and filter them for... lets say... "from 'bros'" (just as a slightly silly example but one that implies some curation of the data).
That collection of posts (the Bro Data Set) would be something that could be copyrighted as a collection (setting aside the "is this a derivative work of the posts" question).
Going from the collection of posts to a model, however, is a purely mechanical process. There is no human creative element in creating the model from the collection of posts. Thus the model wouldn't be sufficiently creative to have a copyright of its own.
The question of "is the model infringing on the copyrights" is one that is open and interesting. I (not a lawyer) would side on that it is sufficiently transformative that the model, while not being able to be copyrighted itself isn't infringing on the copyrights of the material that was used to train it - HOWEVER it may produce infringing works when prompted to do so either intentionally or unintentionally.
Going back to the cookbook. If you create a cookbook of seafood recipes (recipes are not copyrightable, but the cookbook is because it is curated data) and I take that cookbook and apply the mechanical change of "double the recipes - 4 oz of salmon becomes 8 oz and serves 2 becomes serves 4" my collection of recipes isn't copyrightable because all I did was apply math to it. Likewise, taking a collection of posts (or pictures) and applying math to it isn't able to be copyrighted.
Do mathematical formula not fall under copyright? That seems a bit too broad. Perhaps it's a bit of a moot point since mathematical notation is somewhat meaningless without context and the context is copyrightable, but whether I denote the end result of a creative process in mathematical notation or in words shouldn't change whether it is copyrightable or not.
> Copyright law does not protect ideas, methods, or systems. Copyright protection is therefore not available for ideas or procedures for doing, making, or building things; scientific or technical methods or discoveries; business operations or procedures; mathematical principles; formulas or algorithms; or any other concept, process, or method of operation.
> 313.3(A) Ideas, Procedures, Processes, Systems, Methods of Operation, Concepts, Principles, or Discoveries
> Section 102(b) of the Copyright Act expressly excludes copyright protection for “any idea, procedure, process, system, method of operation, concept, principle, or discovery, regardless of the form in which it is described, explained, illustrated, or embodied in such work.” 17 U.S.C. § 102(b); see also 37 C.F.R. § 202.1(b). As such, any work or portion of a work that is an idea, procedure, process, system, method of operation, concept, principle, or discovery does not constitute copyrightable subject matter and cannot be registered.
> ...
> Mathematical principles, formulas, algorithms, or equations.
In the US facts are not copyrightable, so in general math is not copyrightable.
You can copyright creative expressions that use math formulas, but only that expression itself would be covered. E.g. a paper presenting a proof of a theorem would be copyrightable, but all of the facts expressed by the formulas would not be copyrightable.
Or to put a concrete point on it, Photoshop's content aware fill is ("merely") the implementation of a particular SIGGRAPH paper. The math itself isn't copyrightable, but Adobe is going to come after you if you stick their .dll files on GitHub, and probably win.
It's almost certainly true that the collection of "all the good music from the 70s" is copyrightable as a collection, but that doesn't make the weights the result of a creative process.
If anyone's willing to fund the legal battle, let me know. (You can DM me on Twitter: https://twitter.com/theshawwn)
I'd be willing to issue a DMCA counterclaim for llama-dl on the grounds that model weights are not copyrightable. If it's worth settling the question in court, then this seems like a good opportunity.
If Meta has registered the work with the copyright office, the statutory damages in this case, should llama-dl lose, might be quite large.
Check in with an attorney before launching a battle with an opponent who has unlimited resources. There are likely to be many similar test cases in the coming year, perhaps more-readily fought.
IANAL, so this is not legal advise. I consulted a legal expert a few years ago about the status of machine learning models and they said it is really unclear. Apparently, if works are transformed enough that the original is not recognizable anymore it may not violate copyright. It hinges a lot on whether the original work is reproduced, so if you could get an LLM to spit out copyrighted texts unmodified, then it would most likely be copyright violation. But I think that doesn't really happen much in practice.
On the other hand, Meta can have copyright over the model through 'copyright in compilation', which protects compiled works, regardless of the copyright of the underlying material.
So, I fear that it may be possible to have it both ways. But realistically, I think we'll only know for sure when this is fought out in court.
Disclaimer: again I am not a lawyer, so take this with a grain of salt.
You could also imagine a battle over something that started with LLaMa, but fine-tuned it using a non-spare method (i.e. changed every single parameter) so it was very marginally altered in its behaviour.
Even if the base model is copyrightable (possibly a big if), there is a valid question of whether a new model which essentially optimised for something else, but used the base model as a computational shortcut to make it far cheaper to solve an optimisation problem, is still protected by the copyright holder of the base model.
Most of the barrier to creating large language models is the computational cost of training, not coming up with the training set data, so if fine-tuning gets around the copyright issues and allows for better FLOSS-licenced fine-tuned models, that would probably be a good thing (although maybe it will decrease the willingness of companies doing training to release models at all).
Even if this interpretation is correct, it only holds if the law is not changed. But the law is not immutable.
My proclamation could be considered to be in terms of what ought to be, in order for society to be just and to prevent a disproportionate accumulation of power in ultra large corporations, which is detrimental to society.
yeah but that's because there is human creative work in the write ups of the copyrighted work, a court might liken this more to a phone book which which doesn't add any (human) creative expression on top of it.
Not to mention the final product of encyclopedic work and journalism does not internalize the original it is describing in its entirety, like a language model does. In some sense, a language model takes all that there is to be taken from a given resource, and incorporates it into the weights.
It seems intuitively kind of bonkers that they can ignore copyright when generating weights using stuff on the internet, and then turn around and claim copyright on the resulting artifacts.
Tech companies want it both ways. 1. They own the rights to any user content they store and transmit. 2. They are shielded by Section 230, and immune to liability for its misuse.
IDK, its more like finding recipes to many great restaurant chains all mushed together by a 5th grader whose uncle stole it from them, on the sidewalk. looks like a grey area to me legally but IANAL.
A particularly thorny question if data in the training set is copyrighted. Which I presume much of it is, since I was playing with alpaca.cpp and it answered all kinds of pop culture questions I asked it (to varying degrees of accuracy).
They only own the arrangement of what is and isn't in the training set, insamuch as that training set represents human creativity. The process of training model weights is itself purely mechanical.
The closest that they could get would be trade secrecy violations, but that only punishes the original leaker and anyone working in concert with them. I'm not sure if anyone's successfully managed to get an entire BitTorrent swarm to be considered misappropriating trade secrets. Presumably at some point, when the trade secret has been violated, you can obtain it without misappropriating - otherwise, how does that not just become Copyright 2.0?
Copyright-wise, why would the weights be any different from a video file? At the end of the day, LLaMa's weights and fast_and_furious_11.mp4 are both just strings of binary data that some company made to sell.
US copyright law cares about originality and expressiveness, not labour and cost. It can be very expensive to collect and print a list of every business and their phone number in a book, but the result is not copyrightable in the US.
This is a pretty compelling argument to me that the weights shouldn't be able to be copyrighted. In essence, aren't they a list of numbers you can search through to get a result?
Would it makes sense to say a "give me a rowboat on the water" prompt (search) is the same as a "give me the phone number of company XYZ" search in a phonebook?
What exactly is a prompt anyway? Can you copyright the assembly that a compiler spits out? Is an AI prompt the same as source code and its output is the assembly a compiler would generate? Does that mean the model is a compiler? I assume a compiler can be copyrighted, so then maybe the weights can be copyrighted? Or would it make more sense that the combination of (weights + prompt + seed + output) is copyrightable?
I don't have answers to any of these or know if they're reasonable questions but I'm starting to find this all very fascinating.
Don't forget that transformers, the basic architecture of these large language models, were introduced by Google, and many other basic building blocks were introduced by universities and research centers all over the world. So no, the outcome is not very clear, at all.
I've got disappointing news for you. The copyright protecting your open source software never applied to its function, only its creative expression, and only insofar as the expression was separable from its function.
no, machines are not humans. And nobody stole your software, but they may have been in breach of the license. But because there's no financial damage, you probably don't have the financial standing to sue over it.
How is there no financial damage? If models start replacing existing business cases but are only able to do so because they were trained on copyrighted data relating to those business cases, then the financial damage should be very obvious.
If I make a better search engine than Google and take away their ads business, I have damaged them, but I am not liable.
The argument here is that Facebook does not actually own LLaMA, because they don't own the training data, they didn't have humans curate the training data in a creative way, and the actual training process is purely mechanical. If LLaMA is not copyrightable then you cannot be liable for copying it.
> But financial damage on programs distributed for free is…nothing.
Not necessarily.
For instance, if the program is distributed free for a limited set of purposes under a license, but available for a negotiated license (with payment) for other purposes, then the reasonable market value of a license without the restriction would be actual damages.
2) the value of machine learning training on any one particular piece of source code also approaches zero
If a model replaces a business case for software you were giving away for free, how are you financially harmed? Even if you won a lawsuit you can't demonstrate any financial impact of software you give away for free.
Open source software is famously not costless (gratis) software, so the premise of the argument is false. I may for instance benefit from people using my software and then obtaining paid support from me, or using paid add-ons (the "open core" model).
Correct me if I'm wrong, but I don't think you require financial damages to sue for a violation of the license. Standing is conferred from having your (artificial legal) monopoly on the work as the author violated, not necessarily financially.
This is the whole principle that allows the GPL to work.
That aside, we should really stop misusing "steal". Not only is it legally inaccurate (Dowling versus United States), it's semantically inaccurate as well. The conflation of that with mere copyright infringement is a campaign driven by bad faith actors.
Funny how these bigcorps want to protect the copyright of their copyright infringement machines. Can't wait for someone to train a new model off of theirs and challenge them in court over it. Everything you can slurp off of the internet is fair game, right?
Of course, it’s no different than you browsing and reading the sum total of the Internet yourself which is a natural thing we are all capable of and then copying your brain and giving it to anyone who would like to query your headspace. Little known fact, that’s why the brain has a USB port. /s
> If they think they can catch OpenAI, and they want to charge for AI services, then what they're doing makes sense.
Then they wouldn't release anything to the researchers. Reason LLaMA weights are spreading in the first place is because they let a big group of people get access, someone is bound to upload it as a torrent.
Seems they gave people with .edu emails access relatively quickly too, researcher or not.
> Then they wouldn't release anything to the researchers.
I'm sure FB has shared enormous amounts of data with researchers in the past. And they've probably had very little trouble with said researchers uploading data to the internet.
I think it's a pretty bad take that FB sharing data with researchers means that they wanted it spread freely. Even if that data is LLM weights.
They were sharing it with everyone. I applied through their google form and put "zero" for previous research papers and NA for my organization. They still sent me a link
My point obviously wasn't that every researcher leaks data all the time, nor that because Facebook shared data with researcher, they want it to spread freely.
But I'm sure that Facebook when sharing data that is more precarious, like social connections, have been more selective with who they share the data with, compared to the LLaMA dataset which seems to be shared 100% with everyone with a .edu domain, and almost without review for others.
This halfway solution might even be the best of both worlds for Facebook - AI researchers are distracted by the shiny-shiny of gray-market weights, instead of working on truly open models. This might also explain why LLaMA wasn't trained to convergence - they're keeping the converged model for themselves.
Fragmentation at first and you hinder the first mover in this case OpenAI in sizing the market completely and later on you can acquire the necessary pieces to get back into the market or buy time to close the gap.
Current Microsoft strategy seems so push into this direction. Open Source certain technologies and acquire important pieces e.g. Github / Stake in OpenAI etc. to build a bigger picture that they can monetize later.
The claim is slightly misleading since it seems that they are going after repos that distribute Llama weights without going through the official channels.
However, that could be an unfortunate corrolary of the fact that in the US if you do not enforce your IP you give up your rights over it. Overall, giant lose-lose, and I wish there was a truly open source model to build on top of.
Copyright (in the US, at least) is a “natural” right. It exists, as soon as the work exists, and registration is not required. It needs to be explicitly transferred and/or assigned. There is no need to "defend" it.
There have been a lot of challenges, based on “derivative copies,” but these generally seem to be from where someone uses a photo done by someone else, in their own creative work (like the Obama "Change" poster). Some of these challenges succeed, some do not. I would assume someone extending these weights might be considered a “derivative.”
Copyright law is odd, and there’s a lot of “fuzziness,” especially with creative works.
I’ve always been a bit skeptical of applying copyright to compiled and opaque binaries, but I guess I’m in the minority.
>I’ve always been a bit skeptical of applying copyright to compiled and opaque binaries, but I guess I’m in the minority.
You're not in the minority - copyright requiring human authorship and creativity is standing law. A compiled binary cannot be copyrighted on it's own[0]; the copyright flows from the source to the binaries via derivative works. We just don't normally think about this because until very recently all binaries were derivatives of copyrightable human creative expression. Applying a mechanical process to a creative work doesn't make it non-creative, after all.
[0] Hand-assembled binaries would be considered "source code" in this case.
You only lose rights with respect to trademarks. Copyrights do not need to be consistently enforced in order to be enforceable (although if you're not consistent you'll likely have a tough time arguing significant damages)
The real problem, which plagues open source, is the US is extremely financialized, so courts are loathe to grant actual damages for cultural loss due to free stuff getting pirated. There are still statutory damages, though.
My goal was merely to warn everyone in the LLaMA community that Facebook appears to be trying to shut down the ecosystem that sprang up around LLaMA since the beginning of March.
For a bit of background, I created llama-dl on March 5.
Since the repo is now offline, you can find an archived version of the README here: https://archive.is/7t3it
The intent with llama-dl was to kickstart an open source movement related to LLaMA. If you're curious about my personal motivations for this, I did an interview with The Verge about that: https://twitter.com/theshawwn/status/1633456289639542789
Over the next two weeks, llama-dl grew to 3k stars, and (according to my bucket metrics) distributed 4M files. Thanks to the availability of a reliable, high-speed download link to LLaMA, other hackers were able to launch projects such as Dalai:
Dalai has been making headlines all over the place, and especially on ML tiktok. (ML tiktok is surprisingly interesting.)
When Facebook knocked llama-dl offline via DMCA on the 20th, my primary concern was to ensure that Dalai stayed up. After all, the whole point of llama-dl was to encourage the creation of a "killer app" such as Dalai.
This ensures the availability of LLaMA in the short term. However, there's a broader issue at stake.
The question is whether model weights themselves can be copyrighted. It might seem obvious that since compiled binaries can be copyrighted, ML models should also be able to be. But the U.S. Copyright Office recently denied copyright to AI generated outputs: https://www.smithsonianmag.com/smart-news/us-copyright-offic...
> Both in its 2019 decision and its decision this February, the USCO found the “human authorship” element was lacking and was wholly necessary to obtain a copyright, Engadget’s K. Holt wrote. Current copyright law only provides protections to “the fruits of intellectual labor” that “are founded in the creative powers of the [human] mind,” the USCO states.
If the model output isn't copyrightable, is the model itself copyrightable?
It's an interesting and important question, and answering it in court is a necessary step. The outcome will determine how models are treated over the next decade.
Now, all that said, Facebook is proceeding under the (untested) assumption that LLaMA is copyright Meta. If that assumption is correct, then they're well within their legal rights to issue these DMCAs. Llama-dl was little more than a bash script pointing to a download link, yet that's sufficient grounds for DMCA, since the whole point of llama-dl was to circumvent a copyright protection mechanism.
My overall goal here is to simply bring awareness to all of these issues. We're entering an era of closed-source ML. I think the history of computing shows that open source is generally a better bet.
Facebook, if you're reading this, I urge you to reconsider your approach. You had the opportunity to gain an incredible amount of momentum. By killing it off, you're sacrificing your foothold into the hearts and minds of ML hackers. Wouldn't it be a better idea to harness the ecosystem rather than stomp it out of existence? There are so many ways this can facilitate your business in a positive way. Are you sure that being an adversary to your own community is the best way forward?
FWIW, if the model can be copywritten, and Facebook were to own the copyright--as opposed to the people who own any copyrights on the training data--then it is almost certainly a derivative work of the code Facebook released... under the GPL.
(edit:) Any other interpretation would require Facebook to have provided some other input with expressive intent; and, AFAIK, they did not. (Mere random perturbations of your computational equipment or a random seed that you chose at random without even any attempt at curation are not adding expressive intent.)
Why would that be? Images made using GIMP aren't automatically licensed under the GPL. So I don't see the argument for GPL licensed training code infecting the resulting model.
But like, to repeat some of it in a different way for this slightly different context: for that argument to work, Facebook would have to be actually doing something as the author that wasn't just automated in this codebase.
Like, if you use GIMP to work on an image, and then I download GIMP and merely run it... it doesn't do the same thing, right? You--the artist--were actually important in that story, because you provided the expressive intent that led to the resulting image.
But, in this case, the model is merely the result of running that code, not someone using that code as part of their own work: the model is a reasonably-deterministic output of code licensed under the GPL being run on data notably owned by people other than Facebook.
Imagine if, instead, you wrote a program that used GIMP to automatically create a really fancy image. You put a lot of work into the script to generate that image... and then you released that code under the GPL. I think you would be hard-pressed to argue that the GPL wouldn't have something to say about this output.
(I could maybe see an attempt at an argument that the GPL is an awkward license to apply to things that aren't programs; but, a model is in fact a program designed to execute in an interpreter on still yet other data as input, which makes this whole thing feel like a program which algorithmically generates the code for another program, which is actually quite a common use case for the GPL.)
To be clear: I think Facebook doesn't own the copyright, not that the GPL infected it; but, if Facebook DID own the copyright, AFAIK their only expressive input comes in the form of this GPL codebase. (See above linked comment for more exposition of the possibilities. Also see that comment for a more extensive "IANAL" disclaimer, but: I am not a lawyer, no matter how much I focus on copyright issues.)
I was thinking about this the other day and assumed they would just relicense it. Guess I was wrong. What's the point? Streisand effect ensures it will be the most sought-after model while they're making a big deal out of it, and its value is going to depreciate quickly due to the overall pace of AI
It kind of doesn't matter. Llama is great because it lets everyone experiment, optimize and learn with LLMs locally. Once Stability releases their model everyone will just drop llama and quickly apply all their new tools and learnings to that model.
There is a risk of the community getting "entrenched" in an old model/architecture.
In stable diffusion land, for instance, the community is pretty much stuck on the "old" architecture. Newer innovations, like Huggingface diffusers and various optimizations derived from that like PEFT, torch.compile support, AITemplate/TensorRT compilation and various other bits are largely unused.
They are also pretty much stuck on SD 1.5, even though 2.1 is a good base for finetuning.
Pretty interesting that everyone here is overwhelmingly against Meta for trying to keep their models private, but the other company (the one with "open" in their name) is applauded for doing exactly this.
While true, I also believe OpenAI at least has a chance of their model being proprietary. So, while I am angered by OpenAI, I am unsure I could make the same to-me-slamdunk argument for why they have no standing for a DMCA on their weights (were such leaked)... whereas, for Facebook, either the weights can't be copywritten at all, they can be copywritten but the copyright is owned by the people who made the training data, or--and this is key--the model is a derivative work of the code they released under GPL.
My first thought was “I wonder if Alpaca 7B could generate a plausible counter-notice”. Probably not, but I bet if GPT-4’s weights were leaked it could defend itself against DMCA claims.
This is a good thing. Many people are of the opinion that model weights don’t have copyright independent of the source code used to generate them. The source code and training data are all open. The model weights are a deterministic function of those. Claiming they have independent copyright is essentially the same as claiming copyright in a binary you compiled from open source code. Seems like a pretty good chance to get a court to agree with this.
There is a decent chance that model weights are not copyrightable (and I think that's the clearest definition of the law, I don't think they're a compiled work, at most they might be patentable. Particularly in situations where multiple models might converge towards the same weights, that seems more like a discovery than a creative act?
That being said, I don't feel like I know for certain how a court would rule. And I wouldn't like to fight Facebook about it in court regardless, that sounds like a pretty bad time even if the court sides with you. But on the other hand, Facebook might not be keen to test this either.
Any news on whether they're pursuing downstream weights? That's a really interesting question for me - if you started at the LLaMA weights and arrived somewhere else by supplying additional training data and computation, any infringement claim would be very, very murky. I also think it would be practically difficult to prosecute such a claim because it would necessarily involve comparative questions about the consumption of unlicensed training data to generate the original weights. It would certainly make for interesting law.
Carious timing given that the US copyright office itself does not consider these weights to be copyrightable per its statements just a few days ago. I hope someone takes Facebook to court.
People are arguing about whether model weights can be copyrighted, and who knows how courts will eventually rule. But if a site gets a DMCA takedown notice, they are taking a huge gamble if they don't comply: if courts do decide that Meta's copyright is valid, the monetary damages for an infringer could be huge. So I would expect the demands to be honored by anyone who can't afford to be wiped out in court.
Similarly, if they get a properly-written DMCA counter-notification, they are also taking a huge risk if they do comply with the DMCA request. The only difference is that the notification is coming from Facebook and the counter-notification is coming from random programmers.
So, to me, there are three cases, and none of them give Facebook the standing for this to stick (but, of course, they can throw around a DMCA takedown and then require someone to file a counter-claim and maybe a lawsuit or whatever and that is going to be kind of pointless and likely quite expensive for sillysaurusx to do here; I think this takedown noticing law sucks in no small part because of that power differential).
1) You can't copyright weights. A lot of people believe this. I am not sure this is true. I think it might be that there is a fair use argument that the weights are transformative, but having fair use on your infringement doesn't imply a lack of the original copyright being owned by someone. But like, this might be true, and it is not an unreasonable stance.
2) The model weights are a derived work of the training data. This feels right to me, frankly, as much as it irks a lot of people on Hacker News who are excited to use Copilot (or owns shares of Microsoft ;P). In this case, Facebook does not own the copyright as they purposefully used an open training set from third parties (including Wikipedia and OpenCrawl) as a counter to OpenAI's proprietary one.
3) The model weights are a derived work of the training code, in the same way a binary is a derived work of its source code (vs. #2 where the code is a compiler and the source code is the training set). In this case, Facebook would own the copyright... but the resulting binary program--as distributed in the machine interpretable (and executable) format of the model weights--must be GPL as Facebook amazingly used that as the license for their training code.
The interpretation of events that would allow Facebook to have some hope of arguing that the model weights are their own work--that maybe the code is more of a tool like Photoshop and they have a fair use claim on the training data--would imply something that simply is not true: that they are adding some (hopefully extensive) form of expressive input above and beyond those two inputs, in the way someone using Photoshop does when they remix someone else's art into a transformative work.
However, Facebook definitely isn't doing that: they have provided absolutely no expressive input or intent on top of those two inputs, one of which they do not own and the other of which they chose to license for us under GPL! To make this a bit clearer, separate Facebook into two parties for a moment--one which developed the tooling and the other of which ran it--to determine which of the various parties you think owns this result: if you download a program someone else wrote and click a button to run it on some data someone else owns, you simply do not own the copyright on the result. (edit: I wrote some more on this argument in the following linked comment.)
The only thing I can come up with, if I try really really hard to steelman Facebook here, is: maybe, if I were to release a binary to the world that is a compiled copy of code that I also simultaneously released under GPL, the binary might technically have been compiled from an internal pre-licensed (but identical) copy of said code; and so, while if you compile the same binary from the licensed code you get an identical output that is a derived work with GPL rights, when I do that I don't... but this feels like a perilous argument to make as you are going to have such a hard time showing that this was a reasonable way to infer your intent with the simultaneous release.
(Of course, the person who originally agreed to the terms of service attached to the download they got is a totally different matter, but that doesn't mean there are none. There are a lot of limitations to what people can extract out of you if you violate a contract. So like, by having explicitly agreed to those terms with Facebook that person should not have given the world the weights--not because they are copyrighted but merely because they were secret--and there might be some kind of ramification... but, I do not believe that would possibly apply downstream to sillysaurusx.)
(Note: I am not a lawyer. I spend a ridiculous amount of my time vs. a normal engineer working with copyright and both hearing and making arguments about copyright, including directly to the Copyright Office at the Library of Congress as part of my work on Cydia and my efforts to push back on parts of the DMCA alongside lawyers from the EFF... but like, it would be foolish to read my comment here and then embark on a project to do something that might massively infringe on someone's copyrights without running it past a real lawyer. I, certainly, have actual lawyers.)
one great move that FB could have done would be to ride the wave of positive PR + get all investors hyped:
---> "Meta is a credible alternative to OpenAI, the company is switching from "Meta"-bullshit to an "IA"-first company",
and get the investors to pump the Meta stock,
and then dilute some of the shares to raise some cash (or issue new shares to newly specialized IA hires).
But no, FB is still going after the VR gimmicks and NFTs.
4 billion USD per quarter wasted on Oculus (!), while they could use this money to fund and support a whole ecosystem around LLaMA.