RedPajama: Reproduction of LLaMA with friendly license

thrtythreeforty · on April 17, 2023

I'm very glad people are starting to push back against claims of various LLMs being open source. I was beginning to be worried that the term would be forcefully redefined in the ML space to mean "weights available." With the kickoff of projects like this and Databricks' Dolly, I'm heartened to see the community saying "no, we are willing to spend the compute to make actually open models."

(While it's true that the actual model code of Llama is properly open source, it's also useless for inference by itself. Claiming these models are open source seems like having your cake and eating it too - you get accolades for "open sourcing" but still get to control what happens with it.)

ninjin · on April 17, 2023

I can only agree. The number of times we have seen corporations abuse “open source” and “open science” in the context of large language models have been baffling: OPT/LLaMA disallowing commercial usage, BLOOM having an ethical non-open license, GLM having a clause not to “undermine [the People’s Republic of China’s] national security and national unity”, etc. Every single one of these models have been happy to ride on the coattails of the hard work of the open movements by calling themselves open, while only paying lip service to the ideals and definitions underpinning them.

While RedPajama has yet to commit to a license (from what I can see, it is late at night…), they are making all the right noises and I am hopeful that my prediction that we are about to see the floodgates of truly open models blow open and that OpenAI’s “moat” will be proving to be a lot shallower than what they and many others have made us believe over the last six months will come true.

vipulved · on April 17, 2023

Hi, this is Vipul, I am a co-founder of Together. We plan to release the model weights under Apache 2.0. The amount of creativity that Stable Diffusion unleashed for instance is only really possible with permissive licenses!

ninjin · on April 18, 2023

Thank you Vipul, you and the others are really doing god’s work and have the full support of myself and my academic research team, who are eager to push the boundaries with data, prompts, and investigations of whatever you release (in fact, we have spent the last couple of months working to produce multi-lingual prompts and enriching the few open models we had so far). Just a very quick point of feedback.

While I am not a lawyer and Apache 2.0 is likely to be unproblematic, I always find it puzzling as to why people recently are opting to license non-software using software licenses (Apache 2.0 in particular). Hopefully you have access to sensible lawyers, but I was always under the expectation that model weights would fall under a license such as CC-BY rather than Apache 2.0. Sadly it has been too long since I read the recommendations and justifications for this, so I can not find a good reference, but seem to recall the advice came out of FSF.

Taek · on April 17, 2023

Are you working at all with Stability, Eleuther, or LAION? There have been some rumors that they are doing something similar to this and I'm wondering if this is a duplicated effort.

Either way, huge fan, it would be awesome to have a LLaMA set of weights that are fully open.

igravious · on April 18, 2023

“Acknowledgements

We are appreciative to the work done by the growing open-source AI community that made this project possible.

That includes:

    Participants in building the RedPajama dataset including […] LAION.  

    Meta AI — […]. 

    EleutherAI — This project is built on the backs of the great team at EleutherAI — including the source code they provided for training GPT-NeoX. 

    An award of computer time was provided by the INCITE program. This research also used resources of the Oak Ridge Leadership Computing Facility (OLCF), which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725.”

The answer to your question is right there at the bottom of the page in the linked-to blog post :/

pabs3 · on April 20, 2023

What about the training data/code?

https://salsa.debian.org/deeplearning-team/ml-policy

yieldcrv · on April 17, 2023

> not to undermine the national security and national unity

this is a required statement to conform with China’s constitution, or the superseding authoritative social contract there.

think of it like if the Patriot Act was an article of the constitution instead of a random law subservient to the constitution, it would negate other parts of the constitution that we hold near and dear.

this is a useful similarity as both constitutions have assurances of free speech

just one has a fatal heavily leveraged clause that undermines all other parts of that constitution and dictates all facets of life

ninjin · on April 17, 2023

This is interesting, thank you. But then how can any entity in the PRC contribute to open source? Alibaba, Baidu, etc. have released plenty of machine learning code under proper open licenses in the past (not to mention that we have hardware vendors in the PRC contributing to say Linux). The story I heard about GLM was that they were a high enough public profile project that it caught the attention of PRC bureaucrats that pushed for the clause to be included.

Regardless of the cause though, the clause flies afoul of any definition of open out there.

yieldcrv · on April 17, 2023

simplest answer is that Alibaba and Baidu have more party members as stakeholders

but its not likely that any uncontrollable LLM can start spitting out accuracy or things unhelpful to Beijing’s ethos there and be allowed to operate

the model or the service filtering the model has to be controlled

nacs · on April 17, 2023

> this is a required statement to conform with China’s constitution

But doesn't this mean the model training data also excludes anything critical of China?

For example, does their training data include things like this: https://en.wikipedia.org/wiki/1989_Tiananmen_Square_protests... ?

yieldcrv · on April 18, 2023

go test it out and let us know, give it a really hard conversation like if beijing is sensitive to anything else from the last 34 years

danShumway · on April 17, 2023

My only caveat here is that I'm actually really curious to see a ruling about whether model weights can be copyrighted.

I don't think the "Open Source" label people are using is accurate, and I heavily agree that a common thing that companies seem to be trying to do in this space is release what are essentially closed models while calling them open, and it's a really dangerous direction for AI to go. So nothing in your comment is wrong.

But it also feels a little bit like ceding ground to just assume that Llama can't be used commercially just because Facebook says it can't. I never signed a EULA with them, that claim depends entirely on whether or not model weights are under copyright (or under some similar form of IP protection, some people have brought up trade secrets).

And I don't have a super-strong opinion necessarily, but I'm not sure that's a safe assumption for people to make, and I kind of think it might be good to throw an asterisk next to "can't be used for commercial projects" whenever we talk about Llama's restrictions.

But again, I agree with you, it's not the same as saying Llama is Open Source. Even if it does get ruled as having weaker protections, I don't think the term would really apply.

jupp0r · on April 17, 2023

I haven't done so, but don't you sign an agreement when you ask Facebook for a link to download the weights for LLAMA which is currently the only officially supported way of getting those weights (https://github.com/facebookresearch/llama/tree/main#llama) ?

danShumway · on April 17, 2023

I haven't used Llama for anything other than playing around to test its capabilities, so I feel fairly comfortable admitting publicly that when I did that testing, I did not download it from Facebook using an official portal, and I didn't sign any agreement about it.

On that subject, to the best of my knowledge, I also haven't signed any kind of agreement with OpenAI. I've done all of my GPT testing through 3rd-party services or portals that don't require signing EULAs to use.

Ajedi32 · on April 17, 2023

Why would you bother using an "officially supported" way of downloading the weights if they aren't copyrightable anyway?

nmfisher · on April 18, 2023

I got the weights via BitTorrent, so no I didn't sign/agree to anything.

nickcw · on April 17, 2023

To make an analogy with Linux, the weights are (up until now) a very large closed source firmware blob.

pabs3 · on April 20, 2023

I like Debian's ML definitions, a "only weights available under libre license" situation is a "ToxicCandy" model. For a truly libre model you have to have libre GPU drivers/firmware, libre training data, libre training code, libre trained models and libre code to get outputs from the model.

https://salsa.debian.org/deeplearning-team/ml-policy

jrm4 · on April 17, 2023

Lawyer here, still trying to wrap my head around all of it -- but it seems as if what may be different here is the extent to which all of this is practically "open-source" or even "literally free, as in freedom and cost etc" (i.e. generally and widely available REGARDLESS of what the law says)

And then coming second appears to be "companies and whoever who seek to make money, and intend to make some sort of legal restriction part of the biz model."

I have no answers or even predictions here except "this is gonna be interesting."

simonw · on April 17, 2023

The training data - all 1.2 trillion tokens - can be downloaded by grabbing each of the 2,084 URLs listed here: https://data.together.xyz/redpajama-data-1T/v1.0.0/urls.txt

I ran a HEAD request against them all to sum up the total file size, and it's 2.67TB total.

Here's a Datasette Lite URL that lets you explore the size metadata about those files: https://lite.datasette.io/?json=https://gist.github.com/simo...

And a SQL query that shows the breakdown across the different sources:

https://lite.datasette.io/?json=https://gist.github.com/simo...

Sizes here are in GB:

    common_crawl  1341.6166818914935
    c4  806.7667234372348
    github  212.1786002581939
    wikipedia  111.89125544670969
    book  100.43162744678557
    arxiv  87.35323827341199
    stackexchange  74.54870238155127

Common Crawl is in there a few times - they have the following folders:

    common_crawl/2020-05 198 files
    common_crawl/2021-04 176 files
    common_crawl/2023-06 175 files
    common_crawl/2022-05 157 files
    common_crawl/2019-30 153 files

And then C4 as well, which is "a colossal, cleaned version of Common Crawl's web crawl corpus. It was based on Common Crawl dataset": https://paperswithcode.com/dataset/c4

simonw · on April 17, 2023

Wrote this up as a blog post: https://simonwillison.net/2023/Apr/17/redpajama-data/

csris · on April 17, 2023

Hi! I'm the VP of Engineering at Together. Thanks for writing up these instructions! FYI, you can also download all the files with one wget command:

  wget -i https://data.together.xyz/redpajama-data-1T/v1.0.0/urls.txt

This is also mentioned on the dataset card for redpajama-data-1T on Huggingface [1].

[1]: https://huggingface.co/datasets/togethercomputer/RedPajama-D...

simonw · on April 17, 2023

I made sure to include that in my blog post - along with a note that you need 2.67TB of disk space first!

bhaney · on April 17, 2023

> you need 2.67TB of disk space

The data looks like it should compress pretty well. If you use something like btrfs's transparent compression, I wouldn't be surprised if it all fit in less than 0.75TB of disk space while still being usable to any tool that expects uncompressed data.

Edit: It looks like some of this data is already compressed, so maybe not.

csris · on April 18, 2023

Note that you also need about 5TB of disk for the full decompressed dataset. However, only common crawl are compressed in jsonl.zst, everything else is uncompressed jsonl.

doctoboggan · on April 17, 2023

I am a little concerned that they have only about 60% of the code tokens (GitHub and stackexchange). Given that so far the only concrete use case I have for LLMs is coding assistance I wouldn't want this open source model to be and less quality in that area.

In your opinion do you think this will hamper the model at all? Or is it still more than enough to get good coding assistance?

csris · on April 17, 2023

Nice catch! We sampled the github dataset to match the total # tokens seen by LLaMA during training: ~64B tokens (they only pass through 0.64 of their total Github dataset according to the paper). We have a lot of Github data and will make them available soon. Note, we also have not built this for compute optimal training. We are following LLaMA's lead and are training on more data for longer to optimize for quality, not compute.

kir-gadjello · on April 18, 2023

Thank you for developing the pipeline and amassing considerable compute for gathering and preprocessing this dataset!

I'm not sure if this is the right place to ask about this, but could you consider training an LLM using a more advanced, sparse transformer architecture (specifically, "Terraformer" from this paper https://arxiv.org/abs/2111.12763 and this codebase https://github.com/google/trax/blob/master/trax/models/resea... by Google Brain and OpenAI)? I understand the pressure to focus on training a straightforward LLaMA replication, but of course you see that it's a legacy dense architecture which limits its inference performance. This new architecture is not just an academic curiosity but is already validated at scale by Google, providing 10x+ inference performance boost on the same hardware.

Frankly, the community's compute budget - for training and for inference - isn't infinite, and neither is the public's interest in models that do not have advantage (at least in convenience) over closed-source ones; and so we should utilize both those resources as efficiently as possible. It could be a big step forward if you trained at least LLaMA-Terraformer-7B and 13B foundation models on the whole dataset.

doctoboggan · on April 17, 2023

Very good to hear that you are optimizing for inference rather than training. I’ve tried llama and its various instruction tuned siblings and have yet to get equivalent performance to gpt-3.5 on coding tasks. Seeing how the base model performed relative to gpt-3 on the various benchmarks gives me hope that the difference is just in RLHF or other fine tuning steps. I really hope the community is able to get there, Especially if the resulting model is able to be quantized with minimal loss.

rwl4 · on April 17, 2023

I wonder if it would make sense to create tokens for each emoji so they don't have to be multi-token. Especially considering people have experimented with using them for makeshift compression.

sp332 · on April 17, 2023

As mentioned in the post, the smaller models are trained well past "compute-optimal" amounts of data and I would expect are well into diminishing returns. On the other hand, large models are good one-shot and few-shot learners, and might be able to pick up enough context from your prompt alone to be useable, even if it wasn't specifically trained on your use case.

Minus0 · on April 17, 2023

In this context compute optimal isn't quite the same as diminishing returns. If you look at the loss graphs in the Llama paper, you can see that even the curves for the smaller models were still going down at the time they stopped training and weren't anywhere near plateauing yet. LLMs are notoriously data hungry and will take a long time to reach convergence.

Compute optimal here means the point at which it makes sense to move from a smaller to a larger model assuming that: (a) you have a fixed compute budget of FLOPs, and (b) you want to train the best model possible. The problem is that this applies only to training and assumes nothing about the cost of inference. If you actually need to deploy these trained models and support them long-term for hundreds, thousands, even millions of people to use, would you rather deploy a 13B model or a 30B model at the same level of quality, even if the 13B model would be more costly to train?

There is going to be a point at which these models plateau and further improvement will not be possible without moving to a larger model, but Llama doesn't get there quite yet.

bkm · on April 17, 2023

Relevant: https://twitter.com/abacaj/status/1647999551964323844

totoglazer · on April 17, 2023

This tweet is misunderstanding the papers.

jstx1 · on April 17, 2023

Smaller % of training data doesn't necessarily mean lower quality.

spullara · on April 18, 2023

I agree with this as well. Code has been absolutely anemic outside of GPT-3/4. One trick they used was to train it on code first and then also use a lot more code than we see in Llama even.

harisec · on April 18, 2023

Same here. If you believe the following research (which I do), the ability to perform complex reasoning is likely to be from training on code:

https://yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tr...

I think it's essential to increase the quantity of code tokens.

simonw · on April 17, 2023

No idea!

I wonder how hard it would be to fine-tune something built on RedPajama on further code examples to improve performance there.

rcpt · on April 17, 2023

I'm kind of surprised how small that dataset is

fnands · on April 17, 2023

Nice. Thanks for the summary.

So ~4x the size of the Pile, any idea how it stacks up in terms of quality to other big datasets?

afro88 · on April 17, 2023

Interesting they're allowed to use stackexchange. I don't know much about the legalities of scraping. Was this an agreement between them, or is it simply ok to scrape and use the data in a model?

jameshart · on April 17, 2023

The entire purpose of stackexchange was to create a scrapeable index of questions and answers. The scraper they expected was googlebot, not an LLM trainer, and what they expected it to do was build an index of what questions and answers are located on each of their pages.

progbits · on April 17, 2023

https://stackoverflow.com/help/licensing

Doesn't this imply the produced model has to be CC-BY-SA too?

gpm · on April 17, 2023

No. It means that if they are doing something that is prohibited by copyright law (without a license) then it needs to be CC-BY-SA.

The only theory under which training this sort of model is remotely legal is that doing so is not prohibited by copyright law in the first place. If that theory is correct they don't need a license, and they don't need to abide by any terms of licenses that they were granted without asking.

If that theory is incorrect, they have to comply with the stackoverflow license, but they also have to not use any of the (massive amounts) of unlicensed training data they are using, and comply with the numerous incompatible licenses other sources of training data are licensed under. In other words it's impossible to do this.

jameshart · on April 17, 2023

If ‘reading something and using the content to update your Bayesian priors about the world’ is a breach of copyright, then reading things is a breach of copyright. The tricky new world that the LLM opens up is that it lets you distribute an exact copy of the result of having read the thing. That’s something you can’t do with a human mind (although it’s sort of the job description of a ‘teacher’).

archontes · on April 19, 2023

That's not likely to be true. An AI can create a work that is infringing, a picture of a Marvel character, for example. But that doesn't make the AI or its weights or its training an infringement.

gpm · on April 17, 2023

Humans and machines are distinct with respect to copyright law. A human memorizing a book is legal. A machine scanning a book is creating a new copy and is in (at least some cases) illegal. It is not obvious that just because humans are allowed to learn from things that machines also are.

I tend to favour the view that in this case it is legal (by way of the de minimis doctrine), but I don't think it's a trivial question.

jameshart · on April 18, 2023

A human memorizing a book is legal; a human reciting that book aloud for an audience is not (performances of plays require licenses to the performing rights of a work, for example).

Distribution is when the issue arises - not consumption and construction of a mental model.

I acknowledge the parallels are imperfect and this all needs to be worked out in court. But it’s possible that at the pace LLMs are developing, by the time courts start addressing these questions we’ll already be questioning whether the distinction between machines and people is as big as we thought.

gpm · on April 18, 2023

Copyright law prohibits copying (some exceptions apply) amongst other things not just distribution.

pyth0 · on April 18, 2023

brb, acquiring the necessary license to read my son a bedtime story.

gpm · on April 18, 2023

You (typically) need a license to publicly perform a work, not to read it your son.

gattilorenz · on April 17, 2023

Welcome to this can of worms.

CC-BY-SA content needs attribution too, but I don’t see the(se) model(s) in the current state being able to do so.

I imagine we’re gonna see the IBM PC bios/Unix/ReactOS “tainted code” arguments again in court, this time is not the human who is more-or-less knowingly responsible for sneaking in copyrighted code.

wongarsu · on April 17, 2023

By that line of reasoning, GitHub copilot would have to be GPL. Until somebody fights about this in court we don't really know. But even in the worst case the CC-BY-SA is one of the easier licenses to fulfill, not much worse than the MIT-licensed code contained in the dataset.

taneq · on April 17, 2023

Even if the model doesn’t, where does code written with the aid of an llm end up after the various rulings about the output of Stable Diffusion etc. not being copyrightable at all?

patrakov · on April 18, 2023

Good that they disclosed it. In one of the places where I worked before, I had to sign a statement that I won't copy code from stackexchange, because of the unclear licensing. That is, the risk that the answer is quoted from or otherwise based on some open-source project, and because that could, in the worst case, force the company to disclose their code publicly.

andrewaylett · on April 18, 2023

No need to scrape, you can grab a dump from the Internet Archive: https://archive.org/details/stackexchange

omneity · on April 17, 2023

An actually open source LLM would be a game changer. We might need a new license that englobes model usage and training, something GPL-like whereby distributing a retrained model requires contributing data back or making it public, but not if you use it privately.

This will definitely accelerate progress in LLM research, productization and safety. Alpaca, vicuna, gpt4all and others are sporadic repesentations of this that could become a continuous improvement process were the LLM and its license truely open source.

An interesting possible side effect of a GPL-like license is that AIs become unlikely to be trained on private data, the usual moat that big tech wouldn't want/just can't make public if it were to use those GPL-like licensed models.

buzzscale · on April 17, 2023

Dolly 2.0 is fully open, Apache License and the tuning dataset is employee generated:

https://www.databricks.com/blog/2023/04/12/dolly-first-open-...

ipsum2 · on April 17, 2023

Huh? There's plenty of open source LLMs. Pythia, GPT-NeoX, GPT-J, GPT-2, BLOOM-176, are ones I can think of off the top of my head. Pythia is the best performing one IIRC.

MacsHeadroom · on April 18, 2023

Pythia was trained on only 300B tokens and is pretty dumb compared to LLaMA.

Pythia 13B is worse than LLaMA-7B and requires double the resources.

Tepix · on April 18, 2023

Not all use cases need GPT-4 level performance. I'd argue that even LLaMA-7B is quite limited. Also, new and improved models are being released all the time.

lhl · on April 18, 2023

I started keeping a list btw, there are about 20 completely open (Apache 2.0, BSD, MIT, CC-BY) 1B+ parameter foundational LLMs at the moment: https://docs.google.com/spreadsheets/d/1kT4or6b0Fedd-W_jMwYp...

bradknowles · on April 18, 2023

Hmm. Would you be able to combine these LLMs? Or are they already supersets of each other?

gumballindie · on April 17, 2023

The solution is simple. We need an updated GPL license that states that the code cant be used in training ais unless the data model is also open source. A coordinated update of all major open source projects and the issue is sorted as it will force the ai folks to open source their models. Or else they’ll have to stick with generating funny cat pictures.

jacooper · on April 18, 2023

The problem isnt the license, the problem is it being fair use or not. If its fair use the license is irrelevant.

gumballindie · on April 18, 2023

That has an even easier fix: no content no problem.

jupp0r · on April 17, 2023

As with original GPL, this would be almost useless in a commercial context.

e12e · on April 17, 2023

There are commercial devices that ship with a Linux kernel?

Bjartr · on April 17, 2023

Basically every Android device for starters.

sp332 · on April 17, 2023

Using a Linux kernel doesn't mean you have to make your whole project GPL, unless your project is specifically kernel code.

wongarsu · on April 17, 2023

Neither would the proposed model license. Just like the kernel's GPL stops at the userspace boundary, the proposed license would only cover the model definition and weights.

ijustlovemath · on April 17, 2023

I think they mean in terms of enforcement when there's a violation

jupp0r · on April 17, 2023

But do they train the Linux kernel with their customers data?

return_to_monke · on April 17, 2023

with both this and https://Open-Assistant.io, I believe we have entered the Stable Diffusion era of large language models

bugglebeetle · on April 17, 2023

Only if they actually start performing at the level of OpenAI’s models. I’m not a fan of StableDiffusion, but at least their models work at general parity with private offerings. All the LLama derivatives and OpenAssistant stuff performs far below GPT-3.5 for everything I’ve tested.

EveYoung · on April 17, 2023

In my experience, the threshold to be useful is much lower than GPT-3.5. These smaller models can "easily" be finetuned to achieve a comparable performance on a specific task. For example, I've achieved promising results for data summarisiation and image captioning (BLIP2-based) using Alpaca.

Also, server/hardware costs are still a limiting factor for running and finetuning the larger 33/65B Llama models. Especially, if they can only be used for personal toy projects.

bugglebeetle · on April 17, 2023

I don’t use LLMs for anything image related, so I can’t speak to their value there, but almost all simpler NLP tasks are IMO better handled using other techniques that predate them. I’ve yet to see an example where fine-tuning is cheaper/more efficient/better performing than older solutions to these problems.

EveYoung · on April 17, 2023

If older techniques work for you, there is of course no reason to switch to LLMs besides general curiousity or to explore what's possible already. That said, in my case I was enable to generate much more engaging text summaries of tabular data using a Llama derivative.

CuriouslyC · on April 17, 2023

Llama itself performs comparably to GPT3.5 (at least 30/60g models), but the RLHF of chatgtp is much better than what the community has produced thus far, and it's tuned to work well without tinkering. There will be open source models with that level of fine tuning in the near future, at which point ChatGPT4 will mainly be superior for stuff like code that needs the best possible cohesion and accuracy.

jokethrowaway · on April 17, 2023

I don't think there is a ready made alternative to Midjourney.

Midjourney is way more versatile than SD. If you start getting some fine tuned models on civitai, trained to do well some specific tasks, you can get comparable quality but I haven't seen a single model which is able to replace Midjourney.

Llama is no different, it has ok performance on generic queries but still far away from GPT3.5: if you start fine-tuning you can get good perf on specific tasks.

bugglebeetle · on April 17, 2023

SD with ControlNet and some other open source plugins is far more flexible than MidJourney. It just has all the typical hurdles of OSS vs. commercial offerings. Default image quality in Midjourney is better in terms of its pedestrian aesthetic biases, but it’s not very interesting as an actual artistic tool. And I say this as someone who doesn’t like either service and used to be a commercial illustrator before moving into Data Science.

htaunay · on April 17, 2023

Midjourney to me feels like bowling with bumpers

Sure, its very easy to get good results fast, but the tuning that avoids "uglier" images is the same that removes a lot of versatility compared to SD

Also controlnet is a killer feature

barking_biscuit · on April 18, 2023

ControlNet 1.1 is pretty hectic.

asynchronous · on April 17, 2023

Midjourney also doesn’t have controlnet functionality like Stable Diffusion now does, which gives specific posing of a scene a huge edge on SD.

They’re very similar offerings if you’re willing to put in the work on SD.

og_kalu · on April 17, 2023

You're 100 percent right. People will say control bla bla bla and that's certainly true. You can get a lot more control with Stable Diffusion but like 99% of digital comics created with ai art use midjourney. One of the most control and versatility inclined use cases of generated art and midjourney is still easily winning. There's a reason for that.

dragonwriter · on April 18, 2023

> and midjourney is still easily winning. There’s a reason for that.

Sure, Midjourney is a centralized commercial service with a clear statement that you (as a paid user) own the images you create. While that doesn’t resolve all potential copyright issues (as there are still at least theoretical issues with the underlying dataset), if you doing something commercial with it like, say, a webcomic from which you derive income, its a lot simpler than dealing with the SD ecosystem where the plethora of models also have different stated usage restrictions, different suppliers (many of which are hobbyists) to keep track of, and more potential avenues of indirect copyright risk, as well. For some webcomics, even the base CreativeML Open RAIL-M license itself might be problematic.

This isn’t a technical or quality advantage, but its definitely an advantage that would very often tip the balance between two tools if both are minimally adequate to your task.

GaggiX · on April 17, 2023

>I’m not a fan of StableDiffusion

For some technical reason?

bugglebeetle · on April 17, 2023

No, technically it’s all very impressive. My displeasure with them was there doing a Napster-style maneuver to force artists into accepting AI art generation

CuriouslyC · on April 17, 2023

The training was legal, and artists don't have a say under the current law, so your analogy doesn't hold.

bugglebeetle · on April 17, 2023

Neither of these claims have been truly tested in court and vary at the national level, so no, not really.

grumbel · on April 17, 2023

LAION is a German company and what StableDiffusion is doing seems to be covered under UrhG § 44b. If artists don't want their work inspected by bots they have the option to put a robots.txt on their site.

https://www.gesetze-im-internet.de/urhg/__44b.html

https://www.gesetze-im-internet.de/englisch_urhg/englisch_ur...

bobwaycott · on April 17, 2023

While this may very well be covered, I think the general problem in meatspace is that there was no advance notice given to exercise the option to place the proper robots.txt directives to opt out of having one’s artwork collected for model training before it happened, while still preserving the ability to have one’s artwork findable by search engines and the like. I’m sure there are more than a handful of people who have no idea that a robots.txt file can be used to prevent AI data collection—and some may even be surprised to learn the file that’s been used for search engine crawlers is also going to double for AI crawlers.

To push a bit further, there’s something that just feels particularly off about assuming everyone’s content is up for grabs unless the producers do the work to opt out. I think there’s an especially palpable bit of irony looking at it from the EU’s perspective—where cookies must be opt-in, but grabbing all your copyrighted material so companies can do whatever they like with it places the burden on the owner to opt-out. It just feels backward. Perhaps one should have to expressly opt-in to allowing their work to be accessible as training data. At least then there will be a clear signal that the producer of the work can’t later complain, as they willingly granted permission.

Karunamon · on April 17, 2023

I wonder if these authors would have complained so loudly if they had known that other artists were looking at their output to learn how to create their own work? Absolutely none of them sprung from the womb, tablet in hand, to create their work ex nihilo, based on nothing other than their own entirely original thoughts.

astrange · on April 18, 2023

Their art wasn't collected for model training. The #1 artist supposedly being copied in SD1.5 is not in the training data. Artists just don't know how the model works and think you need to put in an "art" image to get an "art" image out, but of course that's not true.

grumbel · on April 19, 2023

> and think you need to put in an "art" image to get an "art" image out, but of course that's not

While you don't need that to generate an image, it's something SD can actual do extremely well with ControlNet, Textual Inversion, LoRA, img2img and so on.

That's an area where things are going to get interesting in the future, as you can take any image, feed it into SD and produce hundreds of AI images from it. Very easily, without much effort and within minutes. The delineating line between original work and derivative becomes extremely blurry here, as what you are copying is not "the image", but just concepts within the image, that can be a pose, camera angle, scene layout, art style or really anything. You can "copy" it with as much variation as you want, you can remix it with other images, text prompts and so on. Where does "looking at reference" stop and "doing a copyright violation" start?

The spooky part with AI art that it stops images from being singular entities, with AI you can explode every piece into millions of possible variations. AI is so fast at generating art that a future where we could generate movies in real time might not be far away. It's already fast enough to produce images and text stories faster than you can consume them. There might be a fundamental shift in art consumption ahead of us.

pizzalife · on April 17, 2023

The existence of a robots.txt file has no legal meaning. The lack of one certainly does not mean the content being served is free to use in any way.

grumbel · on April 18, 2023

It's right there in the law:

"A reservation of use in the case of works which are available online is effective only if it is made in a machine-readable format."

d1sxeyes · on April 19, 2023

Hm. That to me seems to be quite a badly written law. Is a copyright notice written in plain German in the website footer 'machine-readable'? Is there some definition of 'machine-readable' somewhere?

It's also far from clear to me whether a court would find training an LLM to constitute text and data mining 'for the purpose of gathering information, in particular regarding patterns, trends and correlations'.

grumbel · on April 19, 2023

> That to me seems to be quite a badly written law.

It's pretty normal for a law to not be specific on the technicalities so they don't have to update the law whenever the software changes. The de facto standard to prevent bots from scraping your sites has been robots.txt for almost 30 years.

If artists didn't mind Google scraping their images, putting them on their site, adding ads and making billions, I really don't see them having much of a justification to call out StableDiffusion for "stealing" their stuff. In general artists would be in a lot of trouble if taking stuff from the Internet would be outlawed, as that's where they get all their reference images from too.

Either way, I am sure we'll see quite a few lawsuits going forward, laws are always open to interpretation, especially when new technology archives. But long term I really see copyright in general being in a lot of trouble, since derivatives and remixes are becoming completely trivial with AI. Where does the original work stop and the copyright violation starts is being rather difficult to decide when you can just wander around latent space and create literally thousands of similar images in minutes, with as much or as little variation as you want.

d1sxeyes · on April 19, 2023

My argument is that although robots.txt is a machine-readable way of asserting reservation of use, it's not the only machine-readable way, and the law does not seem to place a burden on the rights-holder to choose a particular 'machine-readable format'.

While a court would likely conclude that a watermark on an image is not 'machine-readable' (I say likely—OCR technology would however make it possible that a court could find that a watermark is machine readable), I would say that because the law does not require a specific method, I think it might be found that a copyright notice in the footer, or in an image caption, is indeed 'machine-readable'.

On balance, I agree that there's a lot of things we are woefully underprepared for coming up in the very near future on using tools in this way to generate art. The answer is not simply to try and lock up all the art away from the robots—but I don't know what the answer actually is.

bugglebeetle · on April 17, 2023

None of this voids the terms of international copyright agreements and someone on Hacker News should know better than to claim that a robots.txt on a personal site would cover all instances of an image being scraped. I’m not saying that artists will necessarily come out on the winning end of this battle, but it’s also specious to claim that company says what they did is legal, therefore it is.

CamperBob2 · on April 18, 2023

"Wait, wait, stop, -- I said stop! -- it turns out that, despite the lack of any legal basis for their opinion in any known jurisdiction, user 'bugglebeetle' on Hacker News disapproves of this activity. Better fold up our tents, boys. It was fun while it lasted."

bugglebeetle · on April 18, 2023

Thanks for sharing, “CamperBob2”:

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4022665

pluijzer · on April 17, 2023

Do you mean the use of uncredited use of artists artwork without paying royalties for the training set or AI art generation in general?

bugglebeetle · on April 17, 2023

What I mean is releasing a free service out into the world that allows anyone to effectively pirate an artist’s work. Their intention was obviously to be rewarded by established players for doing this bit of dirty work, forcing artists to accept terms they wouldn’t have otherwise.

moffkalast · on April 17, 2023

> not a fan of StableDiffusion, but at least their models work at general parity with private offerings

I think you're being a bit generous there. Either I'm using it seriously wrong or SD can only generate vague blobs while Midjourney can make some proper stuff. It's a larger difference than GPT 3.5 vs GPT 4.

dragonwriter · on April 17, 2023

> Either I’m using it seriously wrong or SD can only generate vague blobs

You are definitely using it wrong, if the alternative is “SD can only generate vague blobs”. Even the base SD models are much better than that (though, the strength of the SD ecosystem is the availability of custom checkpoints, hypernetworks, LORAs, embdeddings, ControlNet, etc., not just the base models.)

moffkalast · on April 20, 2023

Went back to do some more tests now, and funny enough I can actually get it to make decent stuff after realizing that it just completely sucks at below 512px (I was initially running it at 128x256 to speed up generation). I guess I should stop listening to advice from morons on reddit who said that lower res + upscaling works fine. Lol.

Not sure why there's even an option to go below 512.

nullsense · on April 18, 2023

Definitely using it wrong.

og_kalu · on April 17, 2023

SD isn't comparable to Midjourney. 99% of comics created with ai art use midjourney. One of the most glaring need cases for control and still nothing. There's a reason for that.

GaggiX · on April 17, 2023

I have seen really convincing comics made with SD, much more convincing than any comics made with MJ, and the reason is really obvious. Models and LoRAs on CivitAI and Huggingface are really good, and the fact that MJ can generate slightly better images does not justify the total lack of control.

og_kalu · on April 17, 2023

Never said you couldn't make impressive stuff with SD but feel free to share those comics.

Models on CivitAI are okay. Cool if you're looking for a certain style and/or want to create something that looks like the training images but style isn't everything.

Midjourney generates much better than "slightly better images" and the very fact you say this just tells me you've not even used the thing in any real capacity.

GaggiX · on April 17, 2023

I am very familiar with MJ and know very well how SD can be used to generate images.

I am the author of submissions such as: https://news.ycombinator.com/item?id=35181433, and I am one of the people responsible for the enthusiasm behind the performance of MJ v5.

But no, MJ is not much better if you know how to use SD, although if what you did with SD was just put a prompt in a huggingface space, I can understand why you say that.

>I never said you can't do impressive things with SD, but feel free to share these comics.

I am arguing that they are better than any comics made with MJ, not that they are simply impressive, that's really the entire point. I know some on Pixiv, you can look them up if you want; I am not linking them for obvious reasons (to say they are NSFW is putting it mildly).

astrange · on April 18, 2023

I saw a random perfectly SFW fanart on pixiv just now I was surprised to see was SD-based.

https://www.pixiv.net/en/artworks/107271972

Though, if they're training off official character art that's less cool than reinterpreting it themselves. Means you don't have a "house style".

og_kalu · on April 17, 2023

>But no, MJ is not much better if you know how to use SD, although if what you did with SD was just put a prompt in a huggingface space, I can understand why you say that.

I'm the person behind these - https://huggingface.co/ogkalu I think it's safe to say i know something about SD's capabilities.

>I am arguing that they are better than any comics made with MJ, not that they are simply impressive, that's really the entire point.

Sure that's why i'm asking you to link these comics that are supposedly better than anything Midjourney has ever produced. With a claim like that, i'm sure you understand wanting to see results.

>You can go look them up on Pixiv if you want, they host some; I am not linking them for obvious reasons (to say they are NSFW is putting it mildly).

So you can't link anything that isn't NSFW on pixiv? Lol, that just solidifies my point. Frankly if the best you can come up with is pseudo porn(or maybe not pseudo lol) on pixiv (i don't imagine any readers of that will care about the things i'm looking for) then that's not a very good look.

GaggiX · on April 17, 2023

You seem surprise that porn brings innovation, but you shouldn't if there has to be someone obsessed with creating the best possible illustration, it is indeed a Pixiv user or more generally a user who wants to create porn of their favorite character; moreover, I know these comics not because I have a weird obsession with going to read comics that were created by an AI, I know them because they are good enough to have gone on trend as NSFW comics, whereas the comics made by MJ are known not because they are good comics but because they are made by MJ (so it's cool I guess), so I don't see how it can solidify your point of view ahah, if you can't control the generation every panel will look different, a collage of images, that's why the comics made by MJ seem to be known just because they are made by MJ and not because they are in the interest of others communities like NSFW comics on Pixiv. Also for this reason, I have not saved links to these posts, I found them randomly while browsing Pixiv, another reason why you should look for them yourself.

idle_zealot · on April 17, 2023

Didn't Open Assistant just announce that they weren't releasing their model weights due to safety concerns? Seems like another "Open" AI initiative.

akiselev · on April 17, 2023

That was a joke in the release video. The Pythia model is already released at [1] and the deltas for the LLaMa model should be up here [2] in the next few days.

[1] https://huggingface.co/OpenAssistant/oasst-sft-4-pythia-12b-...

[2] https://huggingface.co/OpenAssistant/oasst-llama-based-model...

RandomBK · on April 17, 2023

Unfortunately [2] is just a placeholder for now, but it does look like the intent is to publish the weights.

Taek · on April 17, 2023

It's also relatively cheap to make your own llama-30 weights, the real value of OpenAssistant is in the training data, and all of that data has been made available.

The OpenAssistant effort gets an A+ for open source contributions.

detrites · on April 17, 2023

The announce video by Yannic contained a (lengthy) gag to that effect, has it been taken out of context or did now something actually happen?

https://youtube.com/watch?v=ddG2fM9i4Kk&t=132

It's easy to miss but after the negative build-up he says: "and... I'm kidding!"

ricardobeat · on April 17, 2023

Dangerous gag, he said “I’m joking” so quickly it’s very easy to miss. I imagine the commenter is not alone in having that wrong impression.

idle_zealot · on April 17, 2023

Oh, ha, yeah this is exactly the gag I fell for. I just noped out of the video and wrote off the project as this was the first I ever heard of them, and their website just has a signup and no downloads I could see.

Too bad my original comment is too old to edit.

circuit10 · on April 17, 2023

Unless something changed, I thought it was that they literally cannot legally release the weights that are based on LLaMA (except maybe with an xor thing) so they’re going to train it based on something else

mindcrime · on April 17, 2023

Is any of the Open Assistant stuff based on LLaMA? I thought they release (at least some version) before LLaMA even dropped?

circuit10 · on April 17, 2023

Yes, there’s also something based on Pythia but it’s a smaller model

selfhoster11 · on April 17, 2023

IIRC, the video said they will train it on a properly open-source model as well.

fortyseven · on April 17, 2023

There was a dumb joke along those lines in an announcement video, meant as a jab at OpenAI. It's easy to miss the "just kidding". (I did, initially.)

franzypants · on April 17, 2023

It might be a little late, but I hope datasets start incorporating patent texts as well:

1. It's a large corpus of technical knowledge; 2. The language is written by experts in a field and reviewed many times, and 3. They have technical drawings with labels and references in the text

The only downside I suppose is that sometimes patents are written with "just enough knowledge" to get it granted but not too much to give away the secret sauce. That's not really that different from many scholarly papers though.

To give a size of scale, the granted patent texts of 2020 (without images) is about 160 GB of data, and we have digitized grants going back to at least 1970.

seunosewa · on April 17, 2023

You wouldn't want chatbots to answer you with the kind of language used in patent texts.

sp332 · on April 17, 2023

LLMs are actually pretty good at translating info in one form into another form.

MayeulC · on April 17, 2023

Now, I don't know if I would rely on it, but I've certainly thought about asking a LLM to write my patent text for me, provided with a technical description.

orpheansodality · on April 18, 2023

The Pile already does!

Part of its contents come from the "USPTO Backgrounds" dataset. From The Pile's paper:

> USPTO Backgrounds is a dataset of background sections from patents granted by the United States Patent and Trademark Office, derived from its published bulk archives. A typical patent background lays out the general context of the invention, gives an overview of the technical field, and sets up the framing of the problem space. We included USPTO Backgrounds because it contains a large volume of technical writing on applied subjects, aimed at a non-technical audience.

More details in the paper: https://arxiv.org/pdf/2101.00027.pdf

The Pile: https://pile.eleuther.ai/

Nitrolo · on April 17, 2023

I don't know how complete the digitization of old texts is, but if you go to worldwide.espacenet.com, search for "airship" and reverse sort by date you get documents from the 1880s.

In fact I'm downloading a whole batch of patent texts right now because I wanted to experiment with semantic search on patent texts.

Anyone here have any pointers on what the state of the art method for semantic search through a large corpus would be? I've just started researching and BERT and friends seems like it was popular about 2 years ago but things move so fast I wouldn't know what I should do now.

What about a medium sized corpus of text, say 100.000 pages of text?

orpheansodality · on April 18, 2023

afaik sentence embeddings via sbert are still considered a pretty viable path. This may be what you were already looking at, but there's more info here: https://www.sbert.net/index.html

rafaelero · on April 17, 2023

That's awesome! Are people thinking about training it for more than just 1 epoch? I believe Gallactica showed that training for even 4 epochs is ok. Also, how amazing would be if the next gen of open-source LLM's increased context window, like adding 8k more tokens? That's probably expensive, but totally doable.

avereveard · on April 18, 2023

The issue with tokens is that they shoot up inference memory usage

muttled · on April 18, 2023

Once this barrier is broken down we'll see a lot of cool things. 32k on GPT-4 is already pretty cool but once we get into hundreds of thousands/millions of tokens of context we'll be able to easily do things only currently achievable with fine tuning and "memory" tricks. Assistants that remember everything you've ever told them, asking detailed questions about large datasets, even complex systems that are bootstrapped from the context.

sp332 · on April 17, 2023

It's including Common Crawl data 4 or 5 times, does that count?

dwheeler · on April 17, 2023

Has anyone investigated to see if OpenCyc can be converted to natural language (presumably English) and then injested into this? Cyc made an attempt years ago to "encode common sense" and a subset called OpenCyc was released. That might be a great way to kickstart information representation of the real world. The latest version of Cyc is proprietary but I think there OpenCyc is an open subset (though I'm having trouble confirming that, so the licensing may not be good).

Some links: https://github.com/bovlb/opencyc https://github.com/asanchez75/opencyc

speed_spread · on April 17, 2023

My understanding is that LLM and Cyc are fundamentally different forms of AI. Even if you could turn OpenCyc into text rules, once ingested it would just dissolve into the ocean of training text data and would not significantly gain more apparent "common sense" than it already had. Maybe a more interesting combination could be to have both Cyc and LLM working side by side and comparing notes before agreeing on a result.

btown · on April 18, 2023

GPT-4 seems capable of creating CycL output from a text prompt. It might be an interesting guard against hallucination - much like a student being asked to show their work, you're forcing the LLM to go through the steps of framing the problem logically, in a way that's interpretable by the teacher independent from the student's thought processes.

That said, it certainly seems like there hasn't been recent work on hosting the OpenCyc knowledge graph in a reasonably modern way, much less the more recent closed-source work by Cycorp (https://cyc.com/). And it's likely GPT-4 doesn't know the full capabilities beyond whatever tutorials were on the web at the time of its training. If I were Cycorp I'd be seriously looking at developing this kind of hybrid model, with an agent model having access to recall their closed-source examples, as a paid cloud offering; there would likely be many who would desire this best-of-both-worlds.

sp332 · on April 17, 2023

I've been wondering this for a while now. Cyc has tons of knowledge in a white-box, formal system. If it just had a front-end that could convert from natural language to Cyc knowledge queries and back, we wouldn't have to worry so much about hallucinations, catastrophic forgetting, or trying to fit the entire database in VRAM.

antman · on April 18, 2023

If we were to dream it could also include libgen-text, which I think is as close as it gets to a detailed repository of world knowledge. It's only 1.2TB more. Torrent/Magnet from an older HN post: https://www.offlineos.com/

Havoc · on April 17, 2023

Love this - I'll happily accept a bit of a quality trade-off for a pure open model. Its a bit like I'm willing to accept trade-offs to ensure my IoT gear is local only even if that means loss of cloud convenience

martythemaniak · on April 17, 2023

This is cool, now we just need to locate 1,000,000 A100-80GB equivalent GPU-hours. If we had a SETI@Home type project setup for this, it would be straightforward - only $50K worth of electricity for the 65B model.

Given the immense momentum behind LLaMA, I'm pretty disappointed that Meta won't just open-source it, but I guess reproducing it is better long-term.

DesiLurker · on April 18, 2023

I think the time for a folding@home or Berkeley bionic@home style project is now. it will also serve as a backend server farm for all the university based research activities, thereby ensuring the research outcome are not beholden to any one company or benefactor.

I remember setting up my PS3 & home desktop for folding project. Its fair game especially if I can use the box to heat the room instead of the furnance.

hsuduebc2 · on April 17, 2023

I'm somehow scared and somehow amazed by speed of this progress.

Jayakumark · on April 17, 2023

This is huge, was just checking today on what would take someone to get a model similar to Llama, since Meta did not share Training code or Dataset.. Looks like they have figured out how to make the dataset ,Main Problem here is pre-processing them. Second step is to make the code to train model and final one do it cheaply.

brucethemoose2 · on April 17, 2023

Maybe they should use whatever Cerebras used. The whole point of their own LLM release was as a maximum compute/$ demonstration on their platform.

Surely there is a better alternative than a bunch of A100s on AWS...

inciampati · on April 18, 2023

Yeah. They will use Frontier at Oak Ridge, also known as the most powerful supercomputing system in the world. Maybe it can run some expensive LLM training for once rather than it's typical diet of physics simulations and quadratic gene-gene interaction models :)

almost_usual · on April 17, 2023

Name is obviously inspired from the Anna Dewdney children’s books.

michael_j_ward · on April 17, 2023

My kids love that book, and my oldest had me read it to his preschool class earlier this year.

Here is a much more creative reading by Ludacris [0]

[0] https://www.youtube.com/watch?v=PFtHeo7oMSU

sytelus · on April 17, 2023

Great to see this but dataset is the trickiest part. There is no way to confirm if this is good dataset unless model is actually trained on it. To reproduce LLaMA, you need $2M of compute.

Robotbeat · on April 17, 2023

Do you have a calculation that shows where that $2M number comes from, EXACTLY?

eiz · on April 17, 2023

https://arxiv.org/pdf/2302.13971.pdf table 15. 1770394 A100-80GB hours to train the entire model suite at the going rate for cloud 8xA100-80GBs (~$12/hr if you could actually get capacity) is ~$2.6M, under extremely optimistic assumptions. YMMV on bulk pricing ;) "the more you buy the more you save"

Robotbeat · on April 17, 2023

Hmmm… the values in the 7B model seem feasible. An order of magnitude lower GPU hours, plus presumably the lower parameter count means it probably could fit on a 24GB Radeon RX 7900 XTX, which has higher single precision flops than the A100 and costs $1000 instead of $15,000.

An order of magnitude lower GPU-hour time, plus if you train it for 210 days instead of 21 days, means you could do a 7B model with 20 consumer GPUs which are $1000 apiece. $20k, not counting mainboard, etc. Really not bad. Might even be doable as a volunteer project.

nl · on April 17, 2023

I'm not aware of any efficient transformer training code for AMD cards.

Also most training is done using bfloat, not single precision (which is usually only used for accumulators)

Robotbeat · on April 18, 2023

Sure, you would need to rewrite the training code for AMD's ecosystem. If you're using mixed precision training, I suppose you're right about BF16. That puts the relative performance of A100 about 2.5x that of the Radeon RX 7900 XT. May be better to go with the Nvidia GeForce RTX 4090 with a $1600 retail.

titaniumtown · on April 18, 2023

It all works with pytorch and huggingface's transformers library out of the box with Rocm.

slavik81 · on April 19, 2023

You would need to compile a few components from source for Navi 31 if you were to try it today, so out-of-the-box is perhaps an overstatement, but it's certainly doable.

sp332 · on April 17, 2023

Page 4 https://arxiv.org/abs/2302.13971

When training a 65B-parameter model, our code processes around 380 tokens/sec/GPU on 2048 A100 GPU with 80GB of RAM. This means that training over our dataset containing 1.4T tokens takes approximately 21 days.

At $4/GPU-hour per A100 80GB GPU, that's $4 * 2,048 * 21 * 24 = $4,128,768.

Robotbeat · on April 17, 2023

Hmmm… so a 7 billion parameter model could probably be trained on consumer GPUs for one or two orders of magnitude lower cost, particularly if you didn’t go well beyond Chinchilla-optimal training time.

nl · on April 17, 2023

The whole point of Llama is to go beyond Chinchilla optional:

> The objective of the scaling laws from Hoffmann et al. (2022) is to determine how to best scale the dataset and model sizes for a particular training compute budget. However, this objective disregards the inference budget, which becomes critical when serving a language model at scale. In this context, given a target level of performance, the preferred model is not the fastest to train but the fastest at inference, and although it may be cheaper to train a large model to reach a certain level of performance, a smaller one trained longer will ultimately be cheaper at inference.

Zuiii · on April 18, 2023

I hope they also build a variant dataset that includes high-quality translations and enough material in other languages. Even with the limited non-english content that slipped through LLaMA's trainig dataset, LLaMA still shows strong (albeit useless) evidence that it would have excelled in expressing knowledge coherently in other languages in my testing.

Obviously this would increase training costs substantially, so I understand why more languages are not included in the base dataset.

crakenzak · on April 18, 2023

hmm, could increasing the amount of data in other languages potentially improve performance within the english portion of the model? from it building deeper understandings of concepts that are found within other languages.

Tepix · on April 17, 2023

Great initiative. Next, we need a lot of compute! Perhaps Tenstorrent wants to make a good impression?

rnosov · on April 17, 2023

> we are training a full suite of models, with the first becoming available in the coming weeks.

Sounds like they already have the compute and began training.

miohtama · on April 17, 2023

They missed the chance to call it OpenPajama

wongarsu · on April 17, 2023

Calling next month's headline: "OpenPajama: RedPajama weights fine-tuned on liberotica and fanfiction.net"

miohtama · on April 18, 2023

Could be NoPajama as well!

smrtinsert · on April 17, 2023

So is the next step is for someone to come in a fine tune on top of it in order to make it a Vicuna? Or can current vicuna deltas be applied?

jamiedg · on April 18, 2023

Hi! I lead Product at Together. We will be releasing a full suite of models trained on this data starting with the first models in the coming weeks. We will release RedPajama base models and RedPajama instruction-tuned models. All of the models will be released under the Apache 2.0 license, allowing commercial use.

Therefore, anyone will be able to fine-tune the RedPajama models using Vicuna or other datasets, given they will be fully open-source.

The RedPajama instruction-tuned models will be fine-tuned only with instruction labels from human labelers and OpenChatKit feedback (). We feel this will keep these models fully "clean" for use in commercial applications without using the output of other commercial models like were used in Alpaca or Vicuna. However, we'll be excited to see all the great fine-tunes created by the open community and are eager to see how close open-source models can get to the quality of leading commercial models over time!!

() OpenChatKit: https://huggingface.co/spaces/togethercomputer/OpenChatKit

rafaelero · on April 17, 2023

Yeah, it's pretty trivial to change the base model from LLaMa to this next one. You just have to finetune it with the same data used previously to train Vicuna.

wesleychen · on April 17, 2023

There's no model yet, only a dataset.

omneity · on April 17, 2023

My understanding is that LLaMa's architecture is open, so the most difficult part is:

1. Getting data of equal or better quality

2. Securing the funding/hardware required for training

3. Learning/figuring out the training challenges needed to tune the process (the PhD part)

It seems #1 is the relatively lowest hanging fruit and a prerequisite for the other two, and that's what the project is (rightfully) tackling at this stage. #2 could be solved by many ways, and doesn't require much innovation if the project and the team are solid. Which takes me to #3, which on the other hand seems to be the make or break part of the project.

I'm not one to doubt the technical prowesses of the RedPajama's team and their contributors, I rather see it economically. How can an AI open-source project compete with big tech in attracting the brilliant minds of our generation? It's enough to look at levels.xyz to see the battle is not ... level.

There's a serious economical challenge in here to have any sort of sustainable open source initiative in AI.

quickthrower2 · on April 17, 2023

As I understand it they have the input data, but next up they are creating the model. I could make a joke about drawing an owl ... but that would be a bit mean. I am really glad people are working on this.

I wonder... who is paying? Will there be restrictions like ethics clauses and suchlike. Not necessarily a bad thing if they do. Will there be restrictions on commercial use.

jamiedg · on April 18, 2023

Hi! I lead Product at Together. We will be releasing a full suite of models trained on this data starting with the first models in the coming weeks. We will release base models and instruction-tuned models. All of the models will be released under the Apache 2.0 license, allowing commercial use.

quickthrower2 · on April 18, 2023

Thanks, that is exciting! Given these LLMs seem to cost $millions (or hundreds of thousands) to train, how is this funded. Is it a government / research funded thing, or VC, or philanthropy for example?

piannucci · on April 17, 2023

If the name is a reference to Ogden Nash's poem then I am very tickled: https://www.madisonpubliclibrary.org/engagement/poetry/poem-...

ricketycricket · on April 17, 2023

I'd guess it's the book Llama Llama Red Pajama: https://openlibrary.org/books/OL24377652M/Llama_Llama_Red_Pa...

gibrown · on April 17, 2023

Ya seems like it’s this and that’s an awesome name!

dllthomas · on April 18, 2023

I'm holding out for the MadAtMama model.

nailer · on April 17, 2023

Someone on HN made a point that weights can’t even have copyright- they lack two of the requirements for being copyrightable:

https://news.ycombinator.com/item?id=35508651

MangezBien · on April 17, 2023

Definitely thought this was about the kid's book.

local_crmdgeon · on April 17, 2023

So how do I use this? As someone new to the domain.

tinco · on April 17, 2023

You download the 2.76TB of data. Then you run it through Llama's training script for a couple months on 40 NVidia A100's, and you should have yourself a pretty fine large language model you could use to host your own ChatGPT service. It'll be significantly worse than ChatGPT for reason's that aren't yet fully clear because OpenAI switched its mission from protecting the earth from nefarious AI developments, to being itself being the origin of possibly nefarious AI developments.

DigitalDopamine · on April 17, 2023

Renting 40 nvidia a100s is around 70k dolar per month (on vultr i see). So this would only cost 420k for 6 months. Seems doable.

Is 40 a100s enough though? I am interested in what this would cost.

nl · on April 17, 2023

LLama 65B used 2048 80G A100s for 21 days:

> When training a 65B-parameter model, our code processes around 380 tokens/sec/GPU on 2048 A100 GPU with 80GB of RAM.[1]

Note that you probably need to budget for double to triple that because things go wrong and it usually takes multiple starts to get a good training run.

Smaller models are cheaper though.

[1] https://arxiv.org/pdf/2302.13971.pdf

DesiLurker · on April 18, 2023

it pains me to see AMD just sitting on their asses through this incredible development of AI & possibly AGI. if they still cant get their shit together then they should spin-off the discrete gpu division into something purely compute focused. I believe now there is enough momentum in the AI/ML space to fully develop innovative ideas on h/w front.

mlboss · on April 17, 2023

It would be great if this can be done on 3090s. Used 3090 usually costs $500-1000 to buy.

skybrian · on April 17, 2023

You don't, since they're not done yet. Someone will come up with a way to use it when they're done.

moelf · on April 18, 2023

why people don't compress plain text:

    2.5G filtered_08cdfa755e6d4d89b673d5bd1acee5f6.sampled.jsonl
    834M filtered_08cdfa755e6d4d89b673d5bd1acee5f6.sampled.jsonl.lz4

FloatArtifact · on April 17, 2023

Code generation, I wonder the difference in output given order of operations with training and fine tuning. What if the model was trained on the documentation and the code base for Python as an example.

Then fine tuning came from training on actual python code on GitHub.

At the model understands the python documentation and the implementation standard library/interpreter. Then is there a reduction of data needed for code generation therefore reducing the size of the data set used for code generation?

lhl · on April 18, 2023

I saw this in my feed recently which was an interesting analysis on how code training was added as a fine tune (Codex) on a foundational model (GPT-3): https://yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tr...

I do wonder if anyone is considering mixing in larger and larger percentages of The Stack https://huggingface.co/datasets/bigcode/the-stack with this or the Pile to get more code and see what happens.

(Likely beyond mere mortals' budgets though.)

robwwilliams · on April 18, 2023

Love to see the inclusion of PubMed Central papers, all Pubmed ID abstracts, bioRxiv and medRxiv papers, and NCBI summaries of gene function in the base training set.

worldsayshi · on April 17, 2023

> GitHub: GitHub data, filtered by licenses and quality

Does anyone know which licenses are filtered into the dataset?

mananaysiempre · on April 17, 2023

The description on the linked HuggingFace page[1] says MIT, BSD and Apache.

[1] https://huggingface.co/datasets/togethercomputer/RedPajama-D...

asddubs · on April 17, 2023

it's better than laundering gpl code, but it still breaks the licensing terms of those licenses as well, namely attribution