Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
RedPajama: Reproduction of LLaMA with friendly license (together.xyz)
864 points by tim_sw on April 17, 2023 | hide | past | favorite | 216 comments


I'm very glad people are starting to push back against claims of various LLMs being open source. I was beginning to be worried that the term would be forcefully redefined in the ML space to mean "weights available." With the kickoff of projects like this and Databricks' Dolly, I'm heartened to see the community saying "no, we are willing to spend the compute to make actually open models."

(While it's true that the actual model code of Llama is properly open source, it's also useless for inference by itself. Claiming these models are open source seems like having your cake and eating it too - you get accolades for "open sourcing" but still get to control what happens with it.)


I can only agree. The number of times we have seen corporations abuse “open source” and “open science” in the context of large language models have been baffling: OPT/LLaMA disallowing commercial usage, BLOOM having an ethical non-open license, GLM having a clause not to “undermine [the People’s Republic of China’s] national security and national unity”, etc. Every single one of these models have been happy to ride on the coattails of the hard work of the open movements by calling themselves open, while only paying lip service to the ideals and definitions underpinning them.

While RedPajama has yet to commit to a license (from what I can see, it is late at night…), they are making all the right noises and I am hopeful that my prediction that we are about to see the floodgates of truly open models blow open and that OpenAI’s “moat” will be proving to be a lot shallower than what they and many others have made us believe over the last six months will come true.


Hi, this is Vipul, I am a co-founder of Together. We plan to release the model weights under Apache 2.0. The amount of creativity that Stable Diffusion unleashed for instance is only really possible with permissive licenses!


Thank you Vipul, you and the others are really doing god’s work and have the full support of myself and my academic research team, who are eager to push the boundaries with data, prompts, and investigations of whatever you release (in fact, we have spent the last couple of months working to produce multi-lingual prompts and enriching the few open models we had so far). Just a very quick point of feedback.

While I am not a lawyer and Apache 2.0 is likely to be unproblematic, I always find it puzzling as to why people recently are opting to license non-software using software licenses (Apache 2.0 in particular). Hopefully you have access to sensible lawyers, but I was always under the expectation that model weights would fall under a license such as CC-BY rather than Apache 2.0. Sadly it has been too long since I read the recommendations and justifications for this, so I can not find a good reference, but seem to recall the advice came out of FSF.


Are you working at all with Stability, Eleuther, or LAION? There have been some rumors that they are doing something similar to this and I'm wondering if this is a duplicated effort.

Either way, huge fan, it would be awesome to have a LLaMA set of weights that are fully open.


“Acknowledgements

We are appreciative to the work done by the growing open-source AI community that made this project possible.

That includes:

    Participants in building the RedPajama dataset including […] LAION.  

    Meta AI — […]. 

    EleutherAI — This project is built on the backs of the great team at EleutherAI — including the source code they provided for training GPT-NeoX. 

    An award of computer time was provided by the INCITE program. This research also used resources of the Oak Ridge Leadership Computing Facility (OLCF), which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725.”
The answer to your question is right there at the bottom of the page in the linked-to blog post :/



> not to undermine the national security and national unity

this is a required statement to conform with China’s constitution, or the superseding authoritative social contract there.

think of it like if the Patriot Act was an article of the constitution instead of a random law subservient to the constitution, it would negate other parts of the constitution that we hold near and dear.

this is a useful similarity as both constitutions have assurances of free speech

just one has a fatal heavily leveraged clause that undermines all other parts of that constitution and dictates all facets of life


This is interesting, thank you. But then how can any entity in the PRC contribute to open source? Alibaba, Baidu, etc. have released plenty of machine learning code under proper open licenses in the past (not to mention that we have hardware vendors in the PRC contributing to say Linux). The story I heard about GLM was that they were a high enough public profile project that it caught the attention of PRC bureaucrats that pushed for the clause to be included.

Regardless of the cause though, the clause flies afoul of any definition of open out there.


simplest answer is that Alibaba and Baidu have more party members as stakeholders

but its not likely that any uncontrollable LLM can start spitting out accuracy or things unhelpful to Beijing’s ethos there and be allowed to operate

the model or the service filtering the model has to be controlled


> this is a required statement to conform with China’s constitution

But doesn't this mean the model training data also excludes anything critical of China?

For example, does their training data include things like this: https://en.wikipedia.org/wiki/1989_Tiananmen_Square_protests... ?


go test it out and let us know, give it a really hard conversation like if beijing is sensitive to anything else from the last 34 years


My only caveat here is that I'm actually really curious to see a ruling about whether model weights can be copyrighted.

I don't think the "Open Source" label people are using is accurate, and I heavily agree that a common thing that companies seem to be trying to do in this space is release what are essentially closed models while calling them open, and it's a really dangerous direction for AI to go. So nothing in your comment is wrong.

But it also feels a little bit like ceding ground to just assume that Llama can't be used commercially just because Facebook says it can't. I never signed a EULA with them, that claim depends entirely on whether or not model weights are under copyright (or under some similar form of IP protection, some people have brought up trade secrets).

And I don't have a super-strong opinion necessarily, but I'm not sure that's a safe assumption for people to make, and I kind of think it might be good to throw an asterisk next to "can't be used for commercial projects" whenever we talk about Llama's restrictions.

But again, I agree with you, it's not the same as saying Llama is Open Source. Even if it does get ruled as having weaker protections, I don't think the term would really apply.


I haven't done so, but don't you sign an agreement when you ask Facebook for a link to download the weights for LLAMA which is currently the only officially supported way of getting those weights (https://github.com/facebookresearch/llama/tree/main#llama) ?


I haven't used Llama for anything other than playing around to test its capabilities, so I feel fairly comfortable admitting publicly that when I did that testing, I did not download it from Facebook using an official portal, and I didn't sign any agreement about it.

On that subject, to the best of my knowledge, I also haven't signed any kind of agreement with OpenAI. I've done all of my GPT testing through 3rd-party services or portals that don't require signing EULAs to use.


Why would you bother using an "officially supported" way of downloading the weights if they aren't copyrightable anyway?


I got the weights via BitTorrent, so no I didn't sign/agree to anything.


To make an analogy with Linux, the weights are (up until now) a very large closed source firmware blob.


I like Debian's ML definitions, a "only weights available under libre license" situation is a "ToxicCandy" model. For a truly libre model you have to have libre GPU drivers/firmware, libre training data, libre training code, libre trained models and libre code to get outputs from the model.

https://salsa.debian.org/deeplearning-team/ml-policy


Lawyer here, still trying to wrap my head around all of it -- but it seems as if what may be different here is the extent to which all of this is practically "open-source" or even "literally free, as in freedom and cost etc" (i.e. generally and widely available REGARDLESS of what the law says)

And then coming second appears to be "companies and whoever who seek to make money, and intend to make some sort of legal restriction part of the biz model."

I have no answers or even predictions here except "this is gonna be interesting."


The training data - all 1.2 trillion tokens - can be downloaded by grabbing each of the 2,084 URLs listed here: https://data.together.xyz/redpajama-data-1T/v1.0.0/urls.txt

I ran a HEAD request against them all to sum up the total file size, and it's 2.67TB total.

Here's a Datasette Lite URL that lets you explore the size metadata about those files: https://lite.datasette.io/?json=https://gist.github.com/simo...

And a SQL query that shows the breakdown across the different sources:

https://lite.datasette.io/?json=https://gist.github.com/simo...

Sizes here are in GB:

    common_crawl  1341.6166818914935
    c4  806.7667234372348
    github  212.1786002581939
    wikipedia  111.89125544670969
    book  100.43162744678557
    arxiv  87.35323827341199
    stackexchange  74.54870238155127
Common Crawl is in there a few times - they have the following folders:

    common_crawl/2020-05 198 files
    common_crawl/2021-04 176 files
    common_crawl/2023-06 175 files
    common_crawl/2022-05 157 files
    common_crawl/2019-30 153 files
And then C4 as well, which is "a colossal, cleaned version of Common Crawl's web crawl corpus. It was based on Common Crawl dataset": https://paperswithcode.com/dataset/c4



Hi! I'm the VP of Engineering at Together. Thanks for writing up these instructions! FYI, you can also download all the files with one wget command:

  wget -i https://data.together.xyz/redpajama-data-1T/v1.0.0/urls.txt
This is also mentioned on the dataset card for redpajama-data-1T on Huggingface [1].

[1]: https://huggingface.co/datasets/togethercomputer/RedPajama-D...


I made sure to include that in my blog post - along with a note that you need 2.67TB of disk space first!


> you need 2.67TB of disk space

The data looks like it should compress pretty well. If you use something like btrfs's transparent compression, I wouldn't be surprised if it all fit in less than 0.75TB of disk space while still being usable to any tool that expects uncompressed data.

Edit: It looks like some of this data is already compressed, so maybe not.


Note that you also need about 5TB of disk for the full decompressed dataset. However, only common crawl are compressed in jsonl.zst, everything else is uncompressed jsonl.


I am a little concerned that they have only about 60% of the code tokens (GitHub and stackexchange). Given that so far the only concrete use case I have for LLMs is coding assistance I wouldn't want this open source model to be and less quality in that area.

In your opinion do you think this will hamper the model at all? Or is it still more than enough to get good coding assistance?


Nice catch! We sampled the github dataset to match the total # tokens seen by LLaMA during training: ~64B tokens (they only pass through 0.64 of their total Github dataset according to the paper). We have a lot of Github data and will make them available soon. Note, we also have not built this for compute optimal training. We are following LLaMA's lead and are training on more data for longer to optimize for quality, not compute.


Thank you for developing the pipeline and amassing considerable compute for gathering and preprocessing this dataset!

I'm not sure if this is the right place to ask about this, but could you consider training an LLM using a more advanced, sparse transformer architecture (specifically, "Terraformer" from this paper https://arxiv.org/abs/2111.12763 and this codebase https://github.com/google/trax/blob/master/trax/models/resea... by Google Brain and OpenAI)? I understand the pressure to focus on training a straightforward LLaMA replication, but of course you see that it's a legacy dense architecture which limits its inference performance. This new architecture is not just an academic curiosity but is already validated at scale by Google, providing 10x+ inference performance boost on the same hardware.

Frankly, the community's compute budget - for training and for inference - isn't infinite, and neither is the public's interest in models that do not have advantage (at least in convenience) over closed-source ones; and so we should utilize both those resources as efficiently as possible. It could be a big step forward if you trained at least LLaMA-Terraformer-7B and 13B foundation models on the whole dataset.


Very good to hear that you are optimizing for inference rather than training. I’ve tried llama and its various instruction tuned siblings and have yet to get equivalent performance to gpt-3.5 on coding tasks. Seeing how the base model performed relative to gpt-3 on the various benchmarks gives me hope that the difference is just in RLHF or other fine tuning steps. I really hope the community is able to get there, Especially if the resulting model is able to be quantized with minimal loss.


I wonder if it would make sense to create tokens for each emoji so they don't have to be multi-token. Especially considering people have experimented with using them for makeshift compression.


As mentioned in the post, the smaller models are trained well past "compute-optimal" amounts of data and I would expect are well into diminishing returns. On the other hand, large models are good one-shot and few-shot learners, and might be able to pick up enough context from your prompt alone to be useable, even if it wasn't specifically trained on your use case.


In this context compute optimal isn't quite the same as diminishing returns. If you look at the loss graphs in the Llama paper, you can see that even the curves for the smaller models were still going down at the time they stopped training and weren't anywhere near plateauing yet. LLMs are notoriously data hungry and will take a long time to reach convergence.

Compute optimal here means the point at which it makes sense to move from a smaller to a larger model assuming that: (a) you have a fixed compute budget of FLOPs, and (b) you want to train the best model possible. The problem is that this applies only to training and assumes nothing about the cost of inference. If you actually need to deploy these trained models and support them long-term for hundreds, thousands, even millions of people to use, would you rather deploy a 13B model or a 30B model at the same level of quality, even if the 13B model would be more costly to train?

There is going to be a point at which these models plateau and further improvement will not be possible without moving to a larger model, but Llama doesn't get there quite yet.



This tweet is misunderstanding the papers.


Smaller % of training data doesn't necessarily mean lower quality.


I agree with this as well. Code has been absolutely anemic outside of GPT-3/4. One trick they used was to train it on code first and then also use a lot more code than we see in Llama even.


Same here. If you believe the following research (which I do), the ability to perform complex reasoning is likely to be from training on code:

https://yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tr...

I think it's essential to increase the quantity of code tokens.


No idea!

I wonder how hard it would be to fine-tune something built on RedPajama on further code examples to improve performance there.


I'm kind of surprised how small that dataset is


Nice. Thanks for the summary.

So ~4x the size of the Pile, any idea how it stacks up in terms of quality to other big datasets?


Interesting they're allowed to use stackexchange. I don't know much about the legalities of scraping. Was this an agreement between them, or is it simply ok to scrape and use the data in a model?


The entire purpose of stackexchange was to create a scrapeable index of questions and answers. The scraper they expected was googlebot, not an LLM trainer, and what they expected it to do was build an index of what questions and answers are located on each of their pages.


https://stackoverflow.com/help/licensing

Doesn't this imply the produced model has to be CC-BY-SA too?


No. It means that if they are doing something that is prohibited by copyright law (without a license) then it needs to be CC-BY-SA.

The only theory under which training this sort of model is remotely legal is that doing so is not prohibited by copyright law in the first place. If that theory is correct they don't need a license, and they don't need to abide by any terms of licenses that they were granted without asking.

If that theory is incorrect, they have to comply with the stackoverflow license, but they also have to not use any of the (massive amounts) of unlicensed training data they are using, and comply with the numerous incompatible licenses other sources of training data are licensed under. In other words it's impossible to do this.


If ‘reading something and using the content to update your Bayesian priors about the world’ is a breach of copyright, then reading things is a breach of copyright. The tricky new world that the LLM opens up is that it lets you distribute an exact copy of the result of having read the thing. That’s something you can’t do with a human mind (although it’s sort of the job description of a ‘teacher’).


That's not likely to be true. An AI can create a work that is infringing, a picture of a Marvel character, for example. But that doesn't make the AI or its weights or its training an infringement.


Humans and machines are distinct with respect to copyright law. A human memorizing a book is legal. A machine scanning a book is creating a new copy and is in (at least some cases) illegal. It is not obvious that just because humans are allowed to learn from things that machines also are.

I tend to favour the view that in this case it is legal (by way of the de minimis doctrine), but I don't think it's a trivial question.


A human memorizing a book is legal; a human reciting that book aloud for an audience is not (performances of plays require licenses to the performing rights of a work, for example).

Distribution is when the issue arises - not consumption and construction of a mental model.

I acknowledge the parallels are imperfect and this all needs to be worked out in court. But it’s possible that at the pace LLMs are developing, by the time courts start addressing these questions we’ll already be questioning whether the distinction between machines and people is as big as we thought.


Copyright law prohibits copying (some exceptions apply) amongst other things not just distribution.


brb, acquiring the necessary license to read my son a bedtime story.


You (typically) need a license to publicly perform a work, not to read it your son.


Welcome to this can of worms.

CC-BY-SA content needs attribution too, but I don’t see the(se) model(s) in the current state being able to do so.

I imagine we’re gonna see the IBM PC bios/Unix/ReactOS “tainted code” arguments again in court, this time is not the human who is more-or-less knowingly responsible for sneaking in copyrighted code.


By that line of reasoning, GitHub copilot would have to be GPL. Until somebody fights about this in court we don't really know. But even in the worst case the CC-BY-SA is one of the easier licenses to fulfill, not much worse than the MIT-licensed code contained in the dataset.


Even if the model doesn’t, where does code written with the aid of an llm end up after the various rulings about the output of Stable Diffusion etc. not being copyrightable at all?


Good that they disclosed it. In one of the places where I worked before, I had to sign a statement that I won't copy code from stackexchange, because of the unclear licensing. That is, the risk that the answer is quoted from or otherwise based on some open-source project, and because that could, in the worst case, force the company to disclose their code publicly.


No need to scrape, you can grab a dump from the Internet Archive: https://archive.org/details/stackexchange


An actually open source LLM would be a game changer. We might need a new license that englobes model usage and training, something GPL-like whereby distributing a retrained model requires contributing data back or making it public, but not if you use it privately.

This will definitely accelerate progress in LLM research, productization and safety. Alpaca, vicuna, gpt4all and others are sporadic repesentations of this that could become a continuous improvement process were the LLM and its license truely open source.

An interesting possible side effect of a GPL-like license is that AIs become unlikely to be trained on private data, the usual moat that big tech wouldn't want/just can't make public if it were to use those GPL-like licensed models.


Dolly 2.0 is fully open, Apache License and the tuning dataset is employee generated:

https://www.databricks.com/blog/2023/04/12/dolly-first-open-...


Huh? There's plenty of open source LLMs. Pythia, GPT-NeoX, GPT-J, GPT-2, BLOOM-176, are ones I can think of off the top of my head. Pythia is the best performing one IIRC.


Pythia was trained on only 300B tokens and is pretty dumb compared to LLaMA.

Pythia 13B is worse than LLaMA-7B and requires double the resources.


Not all use cases need GPT-4 level performance. I'd argue that even LLaMA-7B is quite limited. Also, new and improved models are being released all the time.


I started keeping a list btw, there are about 20 completely open (Apache 2.0, BSD, MIT, CC-BY) 1B+ parameter foundational LLMs at the moment: https://docs.google.com/spreadsheets/d/1kT4or6b0Fedd-W_jMwYp...


Hmm. Would you be able to combine these LLMs? Or are they already supersets of each other?


The solution is simple. We need an updated GPL license that states that the code cant be used in training ais unless the data model is also open source. A coordinated update of all major open source projects and the issue is sorted as it will force the ai folks to open source their models. Or else they’ll have to stick with generating funny cat pictures.


The problem isnt the license, the problem is it being fair use or not. If its fair use the license is irrelevant.


That has an even easier fix: no content no problem.


As with original GPL, this would be almost useless in a commercial context.


There are commercial devices that ship with a Linux kernel?


Basically every Android device for starters.


Using a Linux kernel doesn't mean you have to make your whole project GPL, unless your project is specifically kernel code.


Neither would the proposed model license. Just like the kernel's GPL stops at the userspace boundary, the proposed license would only cover the model definition and weights.


I think they mean in terms of enforcement when there's a violation


But do they train the Linux kernel with their customers data?


with both this and https://Open-Assistant.io, I believe we have entered the Stable Diffusion era of large language models


Only if they actually start performing at the level of OpenAI’s models. I’m not a fan of StableDiffusion, but at least their models work at general parity with private offerings. All the LLama derivatives and OpenAssistant stuff performs far below GPT-3.5 for everything I’ve tested.


In my experience, the threshold to be useful is much lower than GPT-3.5. These smaller models can "easily" be finetuned to achieve a comparable performance on a specific task. For example, I've achieved promising results for data summarisiation and image captioning (BLIP2-based) using Alpaca.

Also, server/hardware costs are still a limiting factor for running and finetuning the larger 33/65B Llama models. Especially, if they can only be used for personal toy projects.


I don’t use LLMs for anything image related, so I can’t speak to their value there, but almost all simpler NLP tasks are IMO better handled using other techniques that predate them. I’ve yet to see an example where fine-tuning is cheaper/more efficient/better performing than older solutions to these problems.


If older techniques work for you, there is of course no reason to switch to LLMs besides general curiousity or to explore what's possible already. That said, in my case I was enable to generate much more engaging text summaries of tabular data using a Llama derivative.


Llama itself performs comparably to GPT3.5 (at least 30/60g models), but the RLHF of chatgtp is much better than what the community has produced thus far, and it's tuned to work well without tinkering. There will be open source models with that level of fine tuning in the near future, at which point ChatGPT4 will mainly be superior for stuff like code that needs the best possible cohesion and accuracy.


I don't think there is a ready made alternative to Midjourney.

Midjourney is way more versatile than SD. If you start getting some fine tuned models on civitai, trained to do well some specific tasks, you can get comparable quality but I haven't seen a single model which is able to replace Midjourney.

Llama is no different, it has ok performance on generic queries but still far away from GPT3.5: if you start fine-tuning you can get good perf on specific tasks.


SD with ControlNet and some other open source plugins is far more flexible than MidJourney. It just has all the typical hurdles of OSS vs. commercial offerings. Default image quality in Midjourney is better in terms of its pedestrian aesthetic biases, but it’s not very interesting as an actual artistic tool. And I say this as someone who doesn’t like either service and used to be a commercial illustrator before moving into Data Science.


Midjourney to me feels like bowling with bumpers

Sure, its very easy to get good results fast, but the tuning that avoids "uglier" images is the same that removes a lot of versatility compared to SD

Also controlnet is a killer feature


ControlNet 1.1 is pretty hectic.


Midjourney also doesn’t have controlnet functionality like Stable Diffusion now does, which gives specific posing of a scene a huge edge on SD.

They’re very similar offerings if you’re willing to put in the work on SD.


You're 100 percent right. People will say control bla bla bla and that's certainly true. You can get a lot more control with Stable Diffusion but like 99% of digital comics created with ai art use midjourney. One of the most control and versatility inclined use cases of generated art and midjourney is still easily winning. There's a reason for that.


> and midjourney is still easily winning. There’s a reason for that.

Sure, Midjourney is a centralized commercial service with a clear statement that you (as a paid user) own the images you create. While that doesn’t resolve all potential copyright issues (as there are still at least theoretical issues with the underlying dataset), if you doing something commercial with it like, say, a webcomic from which you derive income, its a lot simpler than dealing with the SD ecosystem where the plethora of models also have different stated usage restrictions, different suppliers (many of which are hobbyists) to keep track of, and more potential avenues of indirect copyright risk, as well. For some webcomics, even the base CreativeML Open RAIL-M license itself might be problematic.

This isn’t a technical or quality advantage, but its definitely an advantage that would very often tip the balance between two tools if both are minimally adequate to your task.


>I’m not a fan of StableDiffusion

For some technical reason?


No, technically it’s all very impressive. My displeasure with them was there doing a Napster-style maneuver to force artists into accepting AI art generation


The training was legal, and artists don't have a say under the current law, so your analogy doesn't hold.


Neither of these claims have been truly tested in court and vary at the national level, so no, not really.


LAION is a German company and what StableDiffusion is doing seems to be covered under UrhG § 44b. If artists don't want their work inspected by bots they have the option to put a robots.txt on their site.

https://www.gesetze-im-internet.de/urhg/__44b.html

https://www.gesetze-im-internet.de/englisch_urhg/englisch_ur...


While this may very well be covered, I think the general problem in meatspace is that there was no advance notice given to exercise the option to place the proper robots.txt directives to opt out of having one’s artwork collected for model training before it happened, while still preserving the ability to have one’s artwork findable by search engines and the like. I’m sure there are more than a handful of people who have no idea that a robots.txt file can be used to prevent AI data collection—and some may even be surprised to learn the file that’s been used for search engine crawlers is also going to double for AI crawlers.

To push a bit further, there’s something that just feels particularly off about assuming everyone’s content is up for grabs unless the producers do the work to opt out. I think there’s an especially palpable bit of irony looking at it from the EU’s perspective—where cookies must be opt-in, but grabbing all your copyrighted material so companies can do whatever they like with it places the burden on the owner to opt-out. It just feels backward. Perhaps one should have to expressly opt-in to allowing their work to be accessible as training data. At least then there will be a clear signal that the producer of the work can’t later complain, as they willingly granted permission.


I wonder if these authors would have complained so loudly if they had known that other artists were looking at their output to learn how to create their own work? Absolutely none of them sprung from the womb, tablet in hand, to create their work ex nihilo, based on nothing other than their own entirely original thoughts.


Their art wasn't collected for model training. The #1 artist supposedly being copied in SD1.5 is not in the training data. Artists just don't know how the model works and think you need to put in an "art" image to get an "art" image out, but of course that's not true.


> and think you need to put in an "art" image to get an "art" image out, but of course that's not

While you don't need that to generate an image, it's something SD can actual do extremely well with ControlNet, Textual Inversion, LoRA, img2img and so on.

That's an area where things are going to get interesting in the future, as you can take any image, feed it into SD and produce hundreds of AI images from it. Very easily, without much effort and within minutes. The delineating line between original work and derivative becomes extremely blurry here, as what you are copying is not "the image", but just concepts within the image, that can be a pose, camera angle, scene layout, art style or really anything. You can "copy" it with as much variation as you want, you can remix it with other images, text prompts and so on. Where does "looking at reference" stop and "doing a copyright violation" start?

The spooky part with AI art that it stops images from being singular entities, with AI you can explode every piece into millions of possible variations. AI is so fast at generating art that a future where we could generate movies in real time might not be far away. It's already fast enough to produce images and text stories faster than you can consume them. There might be a fundamental shift in art consumption ahead of us.


The existence of a robots.txt file has no legal meaning. The lack of one certainly does not mean the content being served is free to use in any way.


It's right there in the law:

"A reservation of use in the case of works which are available online is effective only if it is made in a machine-readable format."


Hm. That to me seems to be quite a badly written law. Is a copyright notice written in plain German in the website footer 'machine-readable'? Is there some definition of 'machine-readable' somewhere?

It's also far from clear to me whether a court would find training an LLM to constitute text and data mining 'for the purpose of gathering information, in particular regarding patterns, trends and correlations'.


> That to me seems to be quite a badly written law.

It's pretty normal for a law to not be specific on the technicalities so they don't have to update the law whenever the software changes. The de facto standard to prevent bots from scraping your sites has been robots.txt for almost 30 years.

If artists didn't mind Google scraping their images, putting them on their site, adding ads and making billions, I really don't see them having much of a justification to call out StableDiffusion for "stealing" their stuff. In general artists would be in a lot of trouble if taking stuff from the Internet would be outlawed, as that's where they get all their reference images from too.

Either way, I am sure we'll see quite a few lawsuits going forward, laws are always open to interpretation, especially when new technology archives. But long term I really see copyright in general being in a lot of trouble, since derivatives and remixes are becoming completely trivial with AI. Where does the original work stop and the copyright violation starts is being rather difficult to decide when you can just wander around latent space and create literally thousands of similar images in minutes, with as much or as little variation as you want.


My argument is that although robots.txt is a machine-readable way of asserting reservation of use, it's not the only machine-readable way, and the law does not seem to place a burden on the rights-holder to choose a particular 'machine-readable format'.

While a court would likely conclude that a watermark on an image is not 'machine-readable' (I say likely—OCR technology would however make it possible that a court could find that a watermark is machine readable), I would say that because the law does not require a specific method, I think it might be found that a copyright notice in the footer, or in an image caption, is indeed 'machine-readable'.

On balance, I agree that there's a lot of things we are woefully underprepared for coming up in the very near future on using tools in this way to generate art. The answer is not simply to try and lock up all the art away from the robots—but I don't know what the answer actually is.


None of this voids the terms of international copyright agreements and someone on Hacker News should know better than to claim that a robots.txt on a personal site would cover all instances of an image being scraped. I’m not saying that artists will necessarily come out on the winning end of this battle, but it’s also specious to claim that company says what they did is legal, therefore it is.


"Wait, wait, stop, -- I said stop! -- it turns out that, despite the lack of any legal basis for their opinion in any known jurisdiction, user 'bugglebeetle' on Hacker News disapproves of this activity. Better fold up our tents, boys. It was fun while it lasted."



Do you mean the use of uncredited use of artists artwork without paying royalties for the training set or AI art generation in general?


What I mean is releasing a free service out into the world that allows anyone to effectively pirate an artist’s work. Their intention was obviously to be rewarded by established players for doing this bit of dirty work, forcing artists to accept terms they wouldn’t have otherwise.


> not a fan of StableDiffusion, but at least their models work at general parity with private offerings

I think you're being a bit generous there. Either I'm using it seriously wrong or SD can only generate vague blobs while Midjourney can make some proper stuff. It's a larger difference than GPT 3.5 vs GPT 4.


> Either I’m using it seriously wrong or SD can only generate vague blobs

You are definitely using it wrong, if the alternative is “SD can only generate vague blobs”. Even the base SD models are much better than that (though, the strength of the SD ecosystem is the availability of custom checkpoints, hypernetworks, LORAs, embdeddings, ControlNet, etc., not just the base models.)


Went back to do some more tests now, and funny enough I can actually get it to make decent stuff after realizing that it just completely sucks at below 512px (I was initially running it at 128x256 to speed up generation). I guess I should stop listening to advice from morons on reddit who said that lower res + upscaling works fine. Lol.

Not sure why there's even an option to go below 512.


Definitely using it wrong.


SD isn't comparable to Midjourney. 99% of comics created with ai art use midjourney. One of the most glaring need cases for control and still nothing. There's a reason for that.


I have seen really convincing comics made with SD, much more convincing than any comics made with MJ, and the reason is really obvious. Models and LoRAs on CivitAI and Huggingface are really good, and the fact that MJ can generate slightly better images does not justify the total lack of control.


Never said you couldn't make impressive stuff with SD but feel free to share those comics.

Models on CivitAI are okay. Cool if you're looking for a certain style and/or want to create something that looks like the training images but style isn't everything.

Midjourney generates much better than "slightly better images" and the very fact you say this just tells me you've not even used the thing in any real capacity.


I am very familiar with MJ and know very well how SD can be used to generate images.

I am the author of submissions such as: https://news.ycombinator.com/item?id=35181433, and I am one of the people responsible for the enthusiasm behind the performance of MJ v5.

But no, MJ is not much better if you know how to use SD, although if what you did with SD was just put a prompt in a huggingface space, I can understand why you say that.

>I never said you can't do impressive things with SD, but feel free to share these comics.

I am arguing that they are better than any comics made with MJ, not that they are simply impressive, that's really the entire point. I know some on Pixiv, you can look them up if you want; I am not linking them for obvious reasons (to say they are NSFW is putting it mildly).


I saw a random perfectly SFW fanart on pixiv just now I was surprised to see was SD-based.

https://www.pixiv.net/en/artworks/107271972

Though, if they're training off official character art that's less cool than reinterpreting it themselves. Means you don't have a "house style".


>But no, MJ is not much better if you know how to use SD, although if what you did with SD was just put a prompt in a huggingface space, I can understand why you say that.

I'm the person behind these - https://huggingface.co/ogkalu I think it's safe to say i know something about SD's capabilities.

>I am arguing that they are better than any comics made with MJ, not that they are simply impressive, that's really the entire point.

Sure that's why i'm asking you to link these comics that are supposedly better than anything Midjourney has ever produced. With a claim like that, i'm sure you understand wanting to see results.

>You can go look them up on Pixiv if you want, they host some; I am not linking them for obvious reasons (to say they are NSFW is putting it mildly).

So you can't link anything that isn't NSFW on pixiv? Lol, that just solidifies my point. Frankly if the best you can come up with is pseudo porn(or maybe not pseudo lol) on pixiv (i don't imagine any readers of that will care about the things i'm looking for) then that's not a very good look.


You seem surprise that porn brings innovation, but you shouldn't if there has to be someone obsessed with creating the best possible illustration, it is indeed a Pixiv user or more generally a user who wants to create porn of their favorite character; moreover, I know these comics not because I have a weird obsession with going to read comics that were created by an AI, I know them because they are good enough to have gone on trend as NSFW comics, whereas the comics made by MJ are known not because they are good comics but because they are made by MJ (so it's cool I guess), so I don't see how it can solidify your point of view ahah, if you can't control the generation every panel will look different, a collage of images, that's why the comics made by MJ seem to be known just because they are made by MJ and not because they are in the interest of others communities like NSFW comics on Pixiv. Also for this reason, I have not saved links to these posts, I found them randomly while browsing Pixiv, another reason why you should look for them yourself.


Didn't Open Assistant just announce that they weren't releasing their model weights due to safety concerns? Seems like another "Open" AI initiative.


That was a joke in the release video. The Pythia model is already released at [1] and the deltas for the LLaMa model should be up here [2] in the next few days.

[1] https://huggingface.co/OpenAssistant/oasst-sft-4-pythia-12b-...

[2] https://huggingface.co/OpenAssistant/oasst-llama-based-model...


Unfortunately [2] is just a placeholder for now, but it does look like the intent is to publish the weights.


It's also relatively cheap to make your own llama-30 weights, the real value of OpenAssistant is in the training data, and all of that data has been made available.

The OpenAssistant effort gets an A+ for open source contributions.


The announce video by Yannic contained a (lengthy) gag to that effect, has it been taken out of context or did now something actually happen?

https://youtube.com/watch?v=ddG2fM9i4Kk&t=132

It's easy to miss but after the negative build-up he says: "and... I'm kidding!"


Dangerous gag, he said “I’m joking” so quickly it’s very easy to miss. I imagine the commenter is not alone in having that wrong impression.


Oh, ha, yeah this is exactly the gag I fell for. I just noped out of the video and wrote off the project as this was the first I ever heard of them, and their website just has a signup and no downloads I could see.

Too bad my original comment is too old to edit.


Unless something changed, I thought it was that they literally cannot legally release the weights that are based on LLaMA (except maybe with an xor thing) so they’re going to train it based on something else


Is any of the Open Assistant stuff based on LLaMA? I thought they release (at least some version) before LLaMA even dropped?


Yes, there’s also something based on Pythia but it’s a smaller model


IIRC, the video said they will train it on a properly open-source model as well.


There was a dumb joke along those lines in an announcement video, meant as a jab at OpenAI. It's easy to miss the "just kidding". (I did, initially.)


It might be a little late, but I hope datasets start incorporating patent texts as well:

1. It's a large corpus of technical knowledge; 2. The language is written by experts in a field and reviewed many times, and 3. They have technical drawings with labels and references in the text

The only downside I suppose is that sometimes patents are written with "just enough knowledge" to get it granted but not too much to give away the secret sauce. That's not really that different from many scholarly papers though.

To give a size of scale, the granted patent texts of 2020 (without images) is about 160 GB of data, and we have digitized grants going back to at least 1970.


You wouldn't want chatbots to answer you with the kind of language used in patent texts.


LLMs are actually pretty good at translating info in one form into another form.


Now, I don't know if I would rely on it, but I've certainly thought about asking a LLM to write my patent text for me, provided with a technical description.


The Pile already does!

Part of its contents come from the "USPTO Backgrounds" dataset. From The Pile's paper:

> USPTO Backgrounds is a dataset of background sections from patents granted by the United States Patent and Trademark Office, derived from its published bulk archives. A typical patent background lays out the general context of the invention, gives an overview of the technical field, and sets up the framing of the problem space. We included USPTO Backgrounds because it contains a large volume of technical writing on applied subjects, aimed at a non-technical audience.

More details in the paper: https://arxiv.org/pdf/2101.00027.pdf

The Pile: https://pile.eleuther.ai/


I don't know how complete the digitization of old texts is, but if you go to worldwide.espacenet.com, search for "airship" and reverse sort by date you get documents from the 1880s.

In fact I'm downloading a whole batch of patent texts right now because I wanted to experiment with semantic search on patent texts.

Anyone here have any pointers on what the state of the art method for semantic search through a large corpus would be? I've just started researching and BERT and friends seems like it was popular about 2 years ago but things move so fast I wouldn't know what I should do now.

What about a medium sized corpus of text, say 100.000 pages of text?


afaik sentence embeddings via sbert are still considered a pretty viable path. This may be what you were already looking at, but there's more info here: https://www.sbert.net/index.html


That's awesome! Are people thinking about training it for more than just 1 epoch? I believe Gallactica showed that training for even 4 epochs is ok. Also, how amazing would be if the next gen of open-source LLM's increased context window, like adding 8k more tokens? That's probably expensive, but totally doable.


The issue with tokens is that they shoot up inference memory usage


Once this barrier is broken down we'll see a lot of cool things. 32k on GPT-4 is already pretty cool but once we get into hundreds of thousands/millions of tokens of context we'll be able to easily do things only currently achievable with fine tuning and "memory" tricks. Assistants that remember everything you've ever told them, asking detailed questions about large datasets, even complex systems that are bootstrapped from the context.


It's including Common Crawl data 4 or 5 times, does that count?


Has anyone investigated to see if OpenCyc can be converted to natural language (presumably English) and then injested into this? Cyc made an attempt years ago to "encode common sense" and a subset called OpenCyc was released. That might be a great way to kickstart information representation of the real world. The latest version of Cyc is proprietary but I think there OpenCyc is an open subset (though I'm having trouble confirming that, so the licensing may not be good).

Some links: https://github.com/bovlb/opencyc https://github.com/asanchez75/opencyc


My understanding is that LLM and Cyc are fundamentally different forms of AI. Even if you could turn OpenCyc into text rules, once ingested it would just dissolve into the ocean of training text data and would not significantly gain more apparent "common sense" than it already had. Maybe a more interesting combination could be to have both Cyc and LLM working side by side and comparing notes before agreeing on a result.


GPT-4 seems capable of creating CycL output from a text prompt. It might be an interesting guard against hallucination - much like a student being asked to show their work, you're forcing the LLM to go through the steps of framing the problem logically, in a way that's interpretable by the teacher independent from the student's thought processes.

That said, it certainly seems like there hasn't been recent work on hosting the OpenCyc knowledge graph in a reasonably modern way, much less the more recent closed-source work by Cycorp (https://cyc.com/). And it's likely GPT-4 doesn't know the full capabilities beyond whatever tutorials were on the web at the time of its training. If I were Cycorp I'd be seriously looking at developing this kind of hybrid model, with an agent model having access to recall their closed-source examples, as a paid cloud offering; there would likely be many who would desire this best-of-both-worlds.


I've been wondering this for a while now. Cyc has tons of knowledge in a white-box, formal system. If it just had a front-end that could convert from natural language to Cyc knowledge queries and back, we wouldn't have to worry so much about hallucinations, catastrophic forgetting, or trying to fit the entire database in VRAM.


If we were to dream it could also include libgen-text, which I think is as close as it gets to a detailed repository of world knowledge. It's only 1.2TB more. Torrent/Magnet from an older HN post: https://www.offlineos.com/


Love this - I'll happily accept a bit of a quality trade-off for a pure open model. Its a bit like I'm willing to accept trade-offs to ensure my IoT gear is local only even if that means loss of cloud convenience


This is cool, now we just need to locate 1,000,000 A100-80GB equivalent GPU-hours. If we had a SETI@Home type project setup for this, it would be straightforward - only $50K worth of electricity for the 65B model.

Given the immense momentum behind LLaMA, I'm pretty disappointed that Meta won't just open-source it, but I guess reproducing it is better long-term.


I think the time for a folding@home or Berkeley bionic@home style project is now. it will also serve as a backend server farm for all the university based research activities, thereby ensuring the research outcome are not beholden to any one company or benefactor.

I remember setting up my PS3 & home desktop for folding project. Its fair game especially if I can use the box to heat the room instead of the furnance.


I'm somehow scared and somehow amazed by speed of this progress.


This is huge, was just checking today on what would take someone to get a model similar to Llama, since Meta did not share Training code or Dataset.. Looks like they have figured out how to make the dataset ,Main Problem here is pre-processing them. Second step is to make the code to train model and final one do it cheaply.


Maybe they should use whatever Cerebras used. The whole point of their own LLM release was as a maximum compute/$ demonstration on their platform.

Surely there is a better alternative than a bunch of A100s on AWS...


Yeah. They will use Frontier at Oak Ridge, also known as the most powerful supercomputing system in the world. Maybe it can run some expensive LLM training for once rather than it's typical diet of physics simulations and quadratic gene-gene interaction models :)


Name is obviously inspired from the Anna Dewdney children’s books.


My kids love that book, and my oldest had me read it to his preschool class earlier this year.

Here is a much more creative reading by Ludacris [0]

[0] https://www.youtube.com/watch?v=PFtHeo7oMSU


Great to see this but dataset is the trickiest part. There is no way to confirm if this is good dataset unless model is actually trained on it. To reproduce LLaMA, you need $2M of compute.


Do you have a calculation that shows where that $2M number comes from, EXACTLY?


https://arxiv.org/pdf/2302.13971.pdf table 15. 1770394 A100-80GB hours to train the entire model suite at the going rate for cloud 8xA100-80GBs (~$12/hr if you could actually get capacity) is ~$2.6M, under extremely optimistic assumptions. YMMV on bulk pricing ;) "the more you buy the more you save"


Hmmm… the values in the 7B model seem feasible. An order of magnitude lower GPU hours, plus presumably the lower parameter count means it probably could fit on a 24GB Radeon RX 7900 XTX, which has higher single precision flops than the A100 and costs $1000 instead of $15,000.

An order of magnitude lower GPU-hour time, plus if you train it for 210 days instead of 21 days, means you could do a 7B model with 20 consumer GPUs which are $1000 apiece. $20k, not counting mainboard, etc. Really not bad. Might even be doable as a volunteer project.


I'm not aware of any efficient transformer training code for AMD cards.

Also most training is done using bfloat, not single precision (which is usually only used for accumulators)


Sure, you would need to rewrite the training code for AMD's ecosystem. If you're using mixed precision training, I suppose you're right about BF16. That puts the relative performance of A100 about 2.5x that of the Radeon RX 7900 XT. May be better to go with the Nvidia GeForce RTX 4090 with a $1600 retail.


It all works with pytorch and huggingface's transformers library out of the box with Rocm.


You would need to compile a few components from source for Navi 31 if you were to try it today, so out-of-the-box is perhaps an overstatement, but it's certainly doable.


Page 4 https://arxiv.org/abs/2302.13971

When training a 65B-parameter model, our code processes around 380 tokens/sec/GPU on 2048 A100 GPU with 80GB of RAM. This means that training over our dataset containing 1.4T tokens takes approximately 21 days.

At $4/GPU-hour per A100 80GB GPU, that's $4 * 2,048 * 21 * 24 = $4,128,768.


Hmmm… so a 7 billion parameter model could probably be trained on consumer GPUs for one or two orders of magnitude lower cost, particularly if you didn’t go well beyond Chinchilla-optimal training time.


The whole point of Llama is to go beyond Chinchilla optional:

> The objective of the scaling laws from Hoffmann et al. (2022) is to determine how to best scale the dataset and model sizes for a particular training compute budget. However, this objective disregards the inference budget, which becomes critical when serving a language model at scale. In this context, given a target level of performance, the preferred model is not the fastest to train but the fastest at inference, and although it may be cheaper to train a large model to reach a certain level of performance, a smaller one trained longer will ultimately be cheaper at inference.


I hope they also build a variant dataset that includes high-quality translations and enough material in other languages. Even with the limited non-english content that slipped through LLaMA's trainig dataset, LLaMA still shows strong (albeit useless) evidence that it would have excelled in expressing knowledge coherently in other languages in my testing.

Obviously this would increase training costs substantially, so I understand why more languages are not included in the base dataset.


hmm, could increasing the amount of data in other languages potentially improve performance within the english portion of the model? from it building deeper understandings of concepts that are found within other languages.


Great initiative. Next, we need a lot of compute! Perhaps Tenstorrent wants to make a good impression?


> we are training a full suite of models, with the first becoming available in the coming weeks.

Sounds like they already have the compute and began training.


They missed the chance to call it OpenPajama


Calling next month's headline: "OpenPajama: RedPajama weights fine-tuned on liberotica and fanfiction.net"


Could be NoPajama as well!


So is the next step is for someone to come in a fine tune on top of it in order to make it a Vicuna? Or can current vicuna deltas be applied?


Hi! I lead Product at Together. We will be releasing a full suite of models trained on this data starting with the first models in the coming weeks. We will release RedPajama base models and RedPajama instruction-tuned models. All of the models will be released under the Apache 2.0 license, allowing commercial use.

Therefore, anyone will be able to fine-tune the RedPajama models using Vicuna or other datasets, given they will be fully open-source.

The RedPajama instruction-tuned models will be fine-tuned only with instruction labels from human labelers and OpenChatKit feedback (). We feel this will keep these models fully "clean" for use in commercial applications without using the output of other commercial models like were used in Alpaca or Vicuna. However, we'll be excited to see all the great fine-tunes created by the open community and are eager to see how close open-source models can get to the quality of leading commercial models over time!!

() OpenChatKit: https://huggingface.co/spaces/togethercomputer/OpenChatKit


Yeah, it's pretty trivial to change the base model from LLaMa to this next one. You just have to finetune it with the same data used previously to train Vicuna.


There's no model yet, only a dataset.


My understanding is that LLaMa's architecture is open, so the most difficult part is:

1. Getting data of equal or better quality

2. Securing the funding/hardware required for training

3. Learning/figuring out the training challenges needed to tune the process (the PhD part)

It seems #1 is the relatively lowest hanging fruit and a prerequisite for the other two, and that's what the project is (rightfully) tackling at this stage. #2 could be solved by many ways, and doesn't require much innovation if the project and the team are solid. Which takes me to #3, which on the other hand seems to be the make or break part of the project.

I'm not one to doubt the technical prowesses of the RedPajama's team and their contributors, I rather see it economically. How can an AI open-source project compete with big tech in attracting the brilliant minds of our generation? It's enough to look at levels.xyz to see the battle is not ... level.

There's a serious economical challenge in here to have any sort of sustainable open source initiative in AI.


As I understand it they have the input data, but next up they are creating the model. I could make a joke about drawing an owl ... but that would be a bit mean. I am really glad people are working on this.

I wonder... who is paying? Will there be restrictions like ethics clauses and suchlike. Not necessarily a bad thing if they do. Will there be restrictions on commercial use.


Hi! I lead Product at Together. We will be releasing a full suite of models trained on this data starting with the first models in the coming weeks. We will release base models and instruction-tuned models. All of the models will be released under the Apache 2.0 license, allowing commercial use.


Thanks, that is exciting! Given these LLMs seem to cost $millions (or hundreds of thousands) to train, how is this funded. Is it a government / research funded thing, or VC, or philanthropy for example?


If the name is a reference to Ogden Nash's poem then I am very tickled: https://www.madisonpubliclibrary.org/engagement/poetry/poem-...


I'd guess it's the book Llama Llama Red Pajama: https://openlibrary.org/books/OL24377652M/Llama_Llama_Red_Pa...


Ya seems like it’s this and that’s an awesome name!


I'm holding out for the MadAtMama model.


Someone on HN made a point that weights can’t even have copyright- they lack two of the requirements for being copyrightable:

https://news.ycombinator.com/item?id=35508651


Definitely thought this was about the kid's book.


So how do I use this? As someone new to the domain.


You download the 2.76TB of data. Then you run it through Llama's training script for a couple months on 40 NVidia A100's, and you should have yourself a pretty fine large language model you could use to host your own ChatGPT service. It'll be significantly worse than ChatGPT for reason's that aren't yet fully clear because OpenAI switched its mission from protecting the earth from nefarious AI developments, to being itself being the origin of possibly nefarious AI developments.


Renting 40 nvidia a100s is around 70k dolar per month (on vultr i see). So this would only cost 420k for 6 months. Seems doable.

Is 40 a100s enough though? I am interested in what this would cost.


LLama 65B used 2048 80G A100s for 21 days:

> When training a 65B-parameter model, our code processes around 380 tokens/sec/GPU on 2048 A100 GPU with 80GB of RAM.[1]

Note that you probably need to budget for double to triple that because things go wrong and it usually takes multiple starts to get a good training run.

Smaller models are cheaper though.

[1] https://arxiv.org/pdf/2302.13971.pdf


it pains me to see AMD just sitting on their asses through this incredible development of AI & possibly AGI. if they still cant get their shit together then they should spin-off the discrete gpu division into something purely compute focused. I believe now there is enough momentum in the AI/ML space to fully develop innovative ideas on h/w front.


It would be great if this can be done on 3090s. Used 3090 usually costs $500-1000 to buy.


You don't, since they're not done yet. Someone will come up with a way to use it when they're done.


why people don't compress plain text:

    2.5G filtered_08cdfa755e6d4d89b673d5bd1acee5f6.sampled.jsonl
    834M filtered_08cdfa755e6d4d89b673d5bd1acee5f6.sampled.jsonl.lz4


Code generation, I wonder the difference in output given order of operations with training and fine tuning. What if the model was trained on the documentation and the code base for Python as an example.

Then fine tuning came from training on actual python code on GitHub.

At the model understands the python documentation and the implementation standard library/interpreter. Then is there a reduction of data needed for code generation therefore reducing the size of the data set used for code generation?


I saw this in my feed recently which was an interesting analysis on how code training was added as a fine tune (Codex) on a foundational model (GPT-3): https://yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tr...

I do wonder if anyone is considering mixing in larger and larger percentages of The Stack https://huggingface.co/datasets/bigcode/the-stack with this or the Pile to get more code and see what happens.

(Likely beyond mere mortals' budgets though.)


Love to see the inclusion of PubMed Central papers, all Pubmed ID abstracts, bioRxiv and medRxiv papers, and NCBI summaries of gene function in the base training set.


> GitHub: GitHub data, filtered by licenses and quality

Does anyone know which licenses are filtered into the dataset?


The description on the linked HuggingFace page[1] says MIT, BSD and Apache.

[1] https://huggingface.co/datasets/togethercomputer/RedPajama-D...


it's better than laundering gpl code, but it still breaks the licensing terms of those licenses as well, namely attribution


I guess that could potentially be fixed if citation ejection can somehow be implemented into it, which seem to be at least feasible?


Pyjama singular actually works, but I'm not sure Pajamas can be singular.


I think that at this point, LLM etymology is way more interesting than LLMs themselves.


Too many of these models are just using stuff like wikipedia. If you want true language comprehension you need something like reddit which contains actual conversation


@dang The title should be changed from MILA to Mila/IQIA.


How does one just use token and train a model? I thought you need to label each one on what it means?


You are thinking of supervised learning, in which a model is trained to generate labels for inputs.

In self-supervised learning, the training target is a modified version of the input.

Transformers are trained in a self supervised manner. The problem they solve is "given a sequence of N tokens, what is the next most likely token?"

No labels required :)


What about the human reinforcement part of it?


There's no traditional human reinforcement.

Models like gpt3 get turned into models like ChatGPT through RLHF (reinforcement learning from human feedback), by fine-tuning the model further on prompts in the style we'd like them to respond in, typically

User: question

Bot: Response

This is done by handcrafting or modifying data from places like stack exchange.


Prob should be compressed as 2k odd separate files and then unzip as needed likely get you to 650gb


Is there a train/test/eval split available somewhere on this dataset too?


Is there a train/test/eval split available for this dataset somewhere?


This guy has kids, so we all know he.. nevermind. I love the name being a parent myself.


1.8T tokens is very high...


when will the first code writing specific model arrive?





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: