More

DreamGen · on Oct 16, 2024

From what I have heard, getting license from them is also far from guaranteed. They are selective about who they want to do business with -- understandable, but something to keep in mind.

DreamGen · on Oct 16, 2024

That would be misleading. They aren't open weight (3B is not available). They aren't compared to Qwen 2.5 which beats them in many of the benchmarks presented while having more permissive license. The closed 3B is not competitive with other API only models, like Gemini Flash 8B which costs less and has better performance.

DreamGen · on Oct 16, 2024

Also, the 3B model, which is API only (so the only thing that matters is price, quality and speed) should be compared to something like Gemini Flash 1.5 8B which is cheaper than this 3B API and also has higher benchmark performance, super long context support, etc.

DreamGen · on Sept 24, 2024

Why I use Llama:

- Ability to self host. This unlocks few things: (1) Customized serving stack with various logit processors, etc. (2) More cost efficient inference.

- Ability to fine tune. Most stock instruct models are quite lame at AI story-writing and role-play and produce slop.

There aren't really any pain points specific to Llama, but if we are creating a wish list:

- Keep the pre-training data diverse. There is a worrying trend where some companies apply heavy handed filtering on the pre-training data that's not just based on quality, but also on content. Quality based filtering is understandable and desirable, but please, keep the pre-training dataset diverse :)

- Efficient inference. Open source is way behind closed source here. TensorRT-LLM is probably the most efficient from what's out there, but it's mostly closed source. Maybe Meta could contribute to some of the open source projects like vLLM (or maybe something lower level...).

- A lot of the improvements we saw recently came from post-training, post-SFT improvements. And it's not just the datasets (which clearly you can't just release), but also algorithms -- and most labs are quite secretive about the details here. The open-source community relies on DPO a lot (and more recently, KTO), since it's easy, but empirically it's not that great.

DreamGen · on April 29, 2024

This could have grave impact on AI development. Here are some responses:

- EFF: https://www.context.fund/policy/2024-03-26SB1047EFFSIA.pdf

- Answer AI: https://www.answer.ai/posts/2024-04-29-sb1047.html

DreamGen · on April 16, 2024

They were released under Apache 2.0 and there are backups in case they decide to not release them, or to only release them after further alignment:

https://huggingface.co/dreamgen/WizardLM-2-7B

https://huggingface.co/dreamgen/WizardLM-2-8x22B

DreamGen · on April 11, 2024

Engaging. But starting over from stage 0 gets old pretty fast.

DreamGen · on March 19, 2024

What's your source on this? They just very recently reached 100K downloads on Android and according to various SEO tools, they get maybe ~4M visits per-month (and these tend to overestimate, plus it's monthly visits, not DAU).

DreamGen · on March 17, 2024

A big distinction is that you can built on top (fine-tune) thus released models as well as if they released the pre-training data.

llm_trw · on March 17, 2024

You can also build on top of binaries if you use gotos and machine code.

shwaj · on March 18, 2024

This seems intentionally obtuse. What you say is true, but it is very obvious that this is much more of a pain than if you had the source code. On the other hand, fine tuning is just as easy, regardless of whether you have the original training data.

llm_trw · on March 18, 2024

If you don't know the original training data statistical distribution then catastrophic forgetting is guaranteed with any extra training.

shwaj · on March 18, 2024

I was going to ask for a reference to support this (although please provide one if handy), but the search term “catastrophic forgetting” is a great entry into the literature. Thanks.

samus · on March 18, 2024

One could also disassemble an executable and build on top of it. Not for the faint of heart and probably illegal, but possible unless it was deliberately obfuscated. Compared to that, it is impossible with state-of-the-art methods to systematically extract the training data from an LLM model. Fragments yes, but not all of it.

visarga · on March 18, 2024

You can do better - generate synthetic data covering all topics. And to make it less prone to hallucination, use RAG or web search for reference material. The Phi-1.5 model was trained on 300B of synthetic tokens generated with chatGPT and it showed a 5x bump in efficiency, punching well above its line.

Synthetic data can be more diverse if you sample carefully with seeded concepts, and it can be more complex than average web text. You can even diff against a garden variety Mistral or LLaMA and only collect knowledge and skills they don't already have. I call this approach "Machine Study", where AI makes its own training data by studying its corpus and learning from other models.

adrianN · on March 18, 2024

Or shell scripts

tarruda · on March 18, 2024

You can fine tune without the pre training data too.

Mistral models are one example, they never released pre training data and there are many fine tunes.

DreamGen · on March 19, 2024

We are in agreement -- that's exactly what I am saying :)

DreamGen · on Feb 21, 2024

Mistral Instruct v0.2 is 32K.

tarruda · on Feb 22, 2024

Mixtral (8x7b) is 32k.

Mistral 7b instruct 0.2 is just a fine tune of Mistral 7b.

netdur · on Feb 22, 2024

original Mistral or GGUF one?