Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

An actually open source LLM would be a game changer. We might need a new license that englobes model usage and training, something GPL-like whereby distributing a retrained model requires contributing data back or making it public, but not if you use it privately.

This will definitely accelerate progress in LLM research, productization and safety. Alpaca, vicuna, gpt4all and others are sporadic repesentations of this that could become a continuous improvement process were the LLM and its license truely open source.

An interesting possible side effect of a GPL-like license is that AIs become unlikely to be trained on private data, the usual moat that big tech wouldn't want/just can't make public if it were to use those GPL-like licensed models.



Dolly 2.0 is fully open, Apache License and the tuning dataset is employee generated:

https://www.databricks.com/blog/2023/04/12/dolly-first-open-...


Huh? There's plenty of open source LLMs. Pythia, GPT-NeoX, GPT-J, GPT-2, BLOOM-176, are ones I can think of off the top of my head. Pythia is the best performing one IIRC.


Pythia was trained on only 300B tokens and is pretty dumb compared to LLaMA.

Pythia 13B is worse than LLaMA-7B and requires double the resources.


Not all use cases need GPT-4 level performance. I'd argue that even LLaMA-7B is quite limited. Also, new and improved models are being released all the time.


I started keeping a list btw, there are about 20 completely open (Apache 2.0, BSD, MIT, CC-BY) 1B+ parameter foundational LLMs at the moment: https://docs.google.com/spreadsheets/d/1kT4or6b0Fedd-W_jMwYp...


Hmm. Would you be able to combine these LLMs? Or are they already supersets of each other?


The solution is simple. We need an updated GPL license that states that the code cant be used in training ais unless the data model is also open source. A coordinated update of all major open source projects and the issue is sorted as it will force the ai folks to open source their models. Or else they’ll have to stick with generating funny cat pictures.


The problem isnt the license, the problem is it being fair use or not. If its fair use the license is irrelevant.


That has an even easier fix: no content no problem.


As with original GPL, this would be almost useless in a commercial context.


There are commercial devices that ship with a Linux kernel?


Basically every Android device for starters.


Using a Linux kernel doesn't mean you have to make your whole project GPL, unless your project is specifically kernel code.


Neither would the proposed model license. Just like the kernel's GPL stops at the userspace boundary, the proposed license would only cover the model definition and weights.


I think they mean in terms of enforcement when there's a violation


But do they train the Linux kernel with their customers data?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: