I lead AI teams at my company. I've advised leadership against any kind of training / fine-tuning anything.
We're not in the business of training models. We will never be as good as OpenAI / Anthropic etc.
Where the real value in applications is smarter prompting techniques and RAG. There is a lot of room at the bottom in doing "dumb" things and simply feeding models with the right context to deliver customer value.
"I've finetuned llama/mistral models that greatly outperform GPT4 with just a prompt"
If you write about your experiments with that in detail I guarantee you'll get a lot of interest. The community is crying out for good, well documented, replicable examples of this kind of thing.
I'm so behind in this area. I had finetuned a model that was SOTA and worth publishing about in October, but procrastinated. I'm scared to check if somebody else already published on this topic.
Do you always assume other people are incompetent? That's not very nice of you.
I mostly work on AI, so I know if I'm overfitting or not. It performs provably better in it's domain (a niche programming language). GPT4 can barely write a hello world for it.
I'm not creating a "better GPT4" general chatbot. I'm finetuning for a specific task.
This. I work in a startup and told upper management we need to keep focusing on ML models that bring tangible benefits to our customers and then, try to integrate LLMs into their current flow instead of pivoting completely to LLMs. It seems they valued the input and now we're going for a hybrid approach.
Hear hear. I know a 3 person startup that has a "lead AI researcher" who is trying to train and fine-tune models. That's not their startup's purpose though... they have an actual product. So wtf are they doing? The lead AI guy thinks he's going to compete with these big companies and it's total fantasy.
That does indeed sound crazy. But finetuning is also a commodity these days. You can train a good Mistral LoRA in under 24 hours on a single consumer GPU. We’re talking about $10 of compute.
You can run a dozen of these LoRAs atop the same base model on the same infrastructure for a dozen specific use cases.
The inference quality, performance and cost can all be substantially better than GPT4 with prompting.
A remarkable comment in that it is clear, confident and wrong.
Fine-tunes lead to catastrophic forgetting.
RAG is only irrelevant if you’re completely disinterested in cost and latency.
We also don’t have enough data to gauge performance of models >200k context window size when reasoning over inputs of that size, much of which will be irrelevant to any particular user. Multiple random needles in haystack tests work flawlessly, but rarely applies to real world activity.
> RAG is already irrelevant with large context windows
Just last Friday I took the contents of the 2024 folder of one of the teams at the company I work for, for which we use RAG at the moment. I dumped the text index, concatenated it and used Google’s API to return the token count, to see if it would fit in Gemini’s 1M context window; turned out it was 5.7M tokens. And that’s less than 3 months worth of documents for that team.
So yeah RAG is not dead yet, although I do question its usefulness, but that’s a separate topic.
I did. But this is under an enterprise deal with them that warrants privacy, not the generally available stuff. OpenAI has similar arrangements (Enterprise ChatGPT) and MS Azure before them.
There is no citation needed. It is indeed trivial to fine-tune. Doing a good job is another matter, but the claim is correct. Google around and find a blog post showing how.
For "citation needed", read "please link me to a blog post showing how, don't just tell me to Google for one".
The internet is full of blog posts about this. That doesn't mean they're actually good - I'd love to be pointed at one that has proven itself useful for someone (and definitely isn't just LLM blog-spam).
I don't care if it's trivial to fine-tune and get crap results - I care about fine-tuning where the result was worth the effort.
It's an internet forum, not an academic journal. Water tight arguments are not needed. If one wants to call bs, they can just do it, no need to dance around the topic by asking for a citation.
OK, I call BS. Fine-tuning an LLM is not "trivial" - especially if you want to get useful results, as opposed to just being able to say "look, I fine-tuned an LLM".
Exactly, it's a forum not Twitter/reddit. Without references and citations this is no better than a bunch of random words, and it's hard to make any argument of substance.
The person asked for citations, leave it be, stop dudexplaining how Internet works for you please.
They didn't ask for citations. They pointed out a citation was needed. It was a clever sounding way of calling bs. They admit as much.
Even when people sincerely ask for a citation on a debatable topic, on an internet forum, it's effectively saying "I won't be hear any opinion that doesn't match my own unless it's as water tight as a law of physics". Another form of this is "show me the data".
Finetuning LLMs is currently the most promising way for next-gen robotics. One of such works (PaLM-e) among other things measured the impact of finetuning on general purpose tasks: https://twitter.com/DannyDriess/status/1632904698108821505
In short, an 8B model could degrade almost 10x after being finetuned on robotics tasks, while 500B model experiences a very minor degradation (~4%) and there's a hope that with a larger-sized model, it could become either zero or even negative (adding more experience improves general purpose reasoning).
What I am saying is that while GPT-4 could beat a finetuned GPT-3.5 class model, I predict good things about finetuned GPT-4 class models, when they become practical outside of OpenAI/Google.
>there's a hope that with a larger-sized model, it could become either zero or even negative (adding more experience improves general purpose reasoning).
The best single work of fiction ever created about LLMs' capabilities (and, perhaps, dangers) is Colossus by Jones. Although I think the film is even better than the book, only the latter mentions how, despite being created specifically for US national defense, Colossus is also fed unrelated data including Shakespeare's sonnets, because its creators do not know if it could be important.
Interesting! I would like to learn more about how AI is being applied to robotics. Do you have any suggestions for how to keep up with developments/ideas in this field?
..and plan to do an updated version soon for much of what's been released since. I've also done work related to LLM and robotics integration, also on that site.
Working my way through your blog post and it is so refreshing. Unfortunately my algorithm currently is showing me takes which are extreme on either end (like in your blog post).
> Technology’s largest leaps occur when new tools are provided to those that want to make things.
I love this sentence. And the general attitude of curiosity of your post.
Thanks! Appreciate the kind words. I should have in the next month or so (interviewing and finishing my Master's, so there's been delays) a follow up that follows more advancements in the router style VLA, sensoiromotor VLM, and advances in embedding enriched vision models in general.
If you want a great overview of what a modern robotics stack would look like with all this, https://ok-robot.github.io/ was really good and will likely make it into the article. It's a VLA combined with existing RL methods to demonstrate multi-tasking robots, and serves as a great glimpes into what a lot of researchers are working on. You won't see these techniques in robots in industrial or commercial settings - we're still too new at this to be reliable or capable enough to deploy these on real tasks.
Extremely hot LLM take: You will often get better results with few-shot prompting (with good examples) on a modern LLM than with a finetuned LLM.
Finetuning was the best option for weaker LLMs with lower context windows (e.g. the original GPT-3): both problems have been solved nowadays.
The cost economics are much better with few-shot prompting to modern LLMs too: input tokens are super cheap (especially with the recently-released Claude Haiku), so giving a lot of examples per call will still end up cheaper than finetuning.
Meanwhile, a finetuned ChatGPT costs 4-6x of normal ChatGPT usage.
For those who were oblivious to it, like myself, the bitter lesson is written by Richard S Sutton who invented reinforcement learning a long, long time ago.
As far as I know, it’s usually called an uncountable noun.
…but ‘computation’ is also uncountable, and your second sentence seems to be perfectly fine to me.
Your examples do not constitute an argument. You haven’t articulated the (purported) difference between the two words; you’ve just decided arbitrarily that some sentences don’t work, and not elaborated or explained at all.
I can make up words too, and provide example sentences: “karrotz are delicious” works. “carrots are delicious” doesn’t. “inside the karrotz” doesn’t work. “inside the carrots” does.
I don’t actually think there is any difference. The above comment about ‘brospeak’ was snarky but I do think it’s more of a cultural phenomenon than a semantic one — unless someone is willing to kindly explain the difference rather than just rolling their eyes!
What exactly is wrong with the sentence ‘this would require huge amounts of computation’? Saying ‘compute’ seems more to be a synonym of ‘computation’ that’s caught on recently than a useful gap-filling addition to the language. Again: reasoned arguments please. Or just ‘we think it sounds cool so we use it’ — that’s fine, too.
EDIT: pondering briefly, perhaps one could argue the difference is something like ‘you can own compute, but you can’t own computation.’ ‘Compute’ is the capacity to carry out computation. …although ‘compute’ seems to be used to refer to the ‘abstract’ computation being done as well as the computational resources, so I don’t know.
I’m stretching it. To be honest I’m not sure it’s a useful (or even real) distinction. I think it’s a matter of fashion, and that’s fine and normal.
Several MSFT AI/ML friends actively dissuaded me and my team from fine-tuning. They said that it's pretty clear in all their internal tests that it "lobotomizes" the general reasoning capabilities of the model, unless you're really careful.
Yes, that's what I've seen from a lot of my experiments with fine-tuning. One should be really careful to not "lobotomize" already capable model and achieve better results at the end. It's trickier than seems from multiple of tutorials.
But I believe that most of the data stored in foundation models are just useless for some particular domain. So it's better to forget something, getting really useful info instead.
The best single work of fiction ever created about LLMs' capabilities (and, perhaps, dangers) is Colossus by Jones. Although I think the film is even better than the book, only the latter mentions how, despite being created specifically for US national defense, Colossus is also fed unrelated data including Shakespeare's sonnets, because its creators do not know if it could be important.
That might be true for finetuning ChatGPT 3.5, but if you can finetune a small model (7B or less) to perform on par with GPT-4, while being faster and private, that’s a different story.
By definition you wouldn’t fine tune a 7B model to be generally as good at GPT4. You would just be trying to overfit some small amount of functionality in a narrow domain.
Yes but from the context of this discussion, we’re trying to figure out the “sweet spot” model size where it’s worth attempting fine tuning. My guess is it’s only worthwhile for matching simple tasks with small models, and any sufficiently complicated task it’s better to do few/zero shot instead.
A low-tech example to create a good blog post title for submission to Hacker News would be a system prompt like:
You are an expert copywriter. Write five distinct blog post titles optimized for high clickthrough for Hacker News for the article the user provides.
Your response must follow the style of these titles:
- The ü/ü Conundrum
- Why isn't preprint review being adopted?
- Majority of web apps could just run on a single server
- Weather Planning for Eclipse Day
- PSChess – A chess engine in PostScript
Then provide the blog post as the user message input.
Here are five distinct blog post titles optimized for high clickthrough on Hacker News for the article provided:
1. Tipping ChatGPT: Does Offering Monetary Incentives Improve AI Text Generation?
2. Quantifying the Impact of Incentives on Large Language Model Performance
3. Carrot or Stick? Exploring the Effects of Positive and Negative Prompts on ChatGPT
4. Gamifying AI: Using "Generation Golf" to Test ChatGPT's Ability to Follow Length Constraints
5. The Curious Case of ChatGPT's Motivations: Can an AI Be Incentivized Like Humans?
Not bad titles, although more verbose than the 5 input examples I gave. I only gave 5 for simplicity: my main point is that you can give it a lot more than five and/or be more aggressive with constraints, like the blog post linked incidentially.
intuitively, prompting like this to get an answer seems basically like the first part of a fine tuning process (more exemplars).
what is your thought here behind why reinforcing good output via a loss optimization is worse than the one shot example? does the model start to over fit at some point towards some local minima? and this is avoided in this scenario?
Use Gemini 1.5 Pro, which has 1.5 million tokens. Prompt it with a logical question and observe it struggling to answer. Then, upload a book on logical thinking in PDF format and ask the same question again. Notice how it can now answer the question effectively.
This discussion is so dumb - finetuning a base model costs ~$1 with LORA/QLORA and can yield same performance as gpt-4, but at 1/100 of the cost per token.
I had opened a new tab back when this comment was just a few minutes old in hopes that when I came back there was some really great blog post linked with the details on the sorcery.
IMO it's possible to over-generalize from this datapoint (lol). While it's true that creating a general "finance" model that's stronger than GPT-4 is hard, training a task-specific model is much easier. Eg. "a model that's better than GPT-4 at answering finance-related questions": very hard. "A model that's better than GPT-4 at extracting forward-looking financial projections in a standard format": very easy.
And in practice, most tasks people are using GPT-4 for in production are more like the latter than the former.
(Disclaimer: building https://openpipe.ai, which makes it super easy to productize this workflow).
I saw some comments here say to check out Claude. From what I can tell, Claude hasn't figure out yet how to do the whole "generate Python code and run it in a Juptyer notebook" for math yet.
Isn't that something you get from the infrastructure surrounding the llm? I thought the "running code" feature didn't need specific support from the llm, besides being able to output conforming json or code when asked to.
That's because the Claude web UI doesn't yet have the equivalent of the ChatGPT Code Interpreter tool (though they say they're working on it). That's not about the quality of the Claude 3 Opus model, which is the model which people think compares to or beats GPT-4. It's about the tooling that has been built for ChatGPT.
Code interpreter is pretty neat, because you can tell ChatGPT to write some code and to make sure the code works, and then it'll write you some bad code, realize it's bad, and then iterate on it until it gets to a place that it's happy with. (Maybe I should say passes its test rather than anthropomorphize ChatGPT as being "happy".)
Right, like the other commenter suggested, that's an infrastructure-level thing, not a model-level thing. Given that you're talking about ChatGPT, I assume you aren't accessing GPT-3.5 or GPT-4 directly through the API but using the app or the interface provided at chat.openai.com. The magic that makes the kinds of interactions you're describing possible amounts to a bit of clever prompting sprinkled on top of some rather impressive frontend design and engineering.
Correctly prompted, even Mistral-7B can write and run code in response to questions, and it's a model that can run on laptops from half a decade ago, with two or three orders of magnitude fewer parameters that GPT-4.
> Right, like the other commenter suggested, that's an infrastructure-level thing, not a model-level thing.
By default, the ChatGPT "model" knows to not try to do math and instead write code to do the math then run it. I get that it's set up infrastructure wise to be able to run it, but why is Claude's main chat UI not trying to instead respond
"hey, do this calculation on your own since I can't" or something of this nature instead of responding to math incorrectly
Because ChatGPT ships with a system prompt which instructs the underlying model to do exactly that. A similar web application could be developed for Claude, and it would perform similarly with the right prompt, as it's quite good at tool use.
For example, I'm able to get Claude-3-Opus to write Python and call a Python interpreter of its own accord when questioned about time series data in some of my data analysis workflows, though I haven't glued together a pretty GUI for it yet (e.g., plots are simply saved to disk). While I haven't run into any problems around calculations yet, I'm sure it wouldn't be too hard to further refine the system prompt and ensure that all calculations are performed or checked using Python.
Doesn't seem like you are very informed on how LLMs work, but just so you know, there are many different versions of Claude, just like how ChatGPT can use different versions of GPT.
Although based on other tasks, overall, GPT-4 seems to be the best, but by a very small margin, so I cancelled my subscription. Although the native mobile app is really great.
Is there a way to use Mistral-large with TTS and STT engines so you can converse with it like you can ChatGPT in the mobile app? it's really great on long drives for learning/talking about stuff, like a customized personal podcast.
Exactly, I absolutely love this feature. And many times the conversation is quite natural and fluid (with good internet connection). I think I'll build something like that myself (:
Not sure. Most of the time GPT-4 is better. Since I'm using Vercel AI playground[1], on almost every query I get a response from all models so it's easy to compare.
I don't know what system and user prompt you are testing with, but as one anecdote, Claude 3 Opus (and only Opus) consistently gives me better coding answers than GPT-4. Maybe it's the type of stuff I am doing or how I phrase things, who knows. I was using GPT-4 since the day it came out but haven't felt like going back so far.
According to Chatbot Arena where people vote on responses blindly and an ELO rating is determined for each LLM, gpt4 is on top slightly ahead of Claude 3 Opus
Claude is excellent for brainstorming, being a thought partner, general knowledge acquisition tasks, and creative writing.
The one mixed-bag weak spot I've found is in coding -- It tends to make more "d'oh" mistakes while coding, but comes up with more creative solutions at the same time ¯\_(ツ)_/¯
One major advantage of FinGPT or Bloomberg's LLM is that the embeddings produced by the model can be used for downstream prediction tasks. GPT-4 does not expose its embeddings so it cannot be used for this.
sorry, noob here trying to make sense of this: you mean you can extract embeddings from the model file or that the embeddings are available in the repo and you can just use those files?
Kind of. You feed the LLM the input text for your prediction, you extract the activations of the final layer of the LLM (so the weights * the input of the previous layers), then use that activation vector, or embedding, as the input for a separate model. This separate model that uses the embedding can be any classifier or regression. A common use case for this is document classification.
Fine tuning will disappear, no reason to invest so heavily in it. Also, most LLM code being written right now (prompts, RAG, etc) will be obsolete once the next model comes out. Anyone starting an LLM application startup is arguably wasting their time, wait until the next iteration is out. Then you will know whats possible.
> Also, most LLM code being written right now (prompts, RAG, etc) will be obsolete once the next model comes out.
Not true. Most prompt techniques that work on current modern LLM models will work on different or future models, although it will require a QA pass for any regressions.
Yeah they will work, but they will also be unnecessary. You will also bake all sorts of logic into your application that will be solved natively on the stronger model. If you believe in the scaling theory, then writing LLM applications is non sensical.
> If you believe in the scaling theory, then writing LLM applications is non sensical.
But not doing it is an opportunity cost. You don’t built skills, tooling and experience, and you don’t get feedback on what works and where you should go.
It’s like computers in the 1990s: there’s always a better one 6 months away, so if you wait for it to stabilise, then you don’t do anything for a decade. Just enjoy the ride, bearing in mind that things change very fast and some things will be obsolete next year.
> Yeah they will work, but they will also be unnecessary. You will also bake all sorts of logic into your application that will be solved natively on the stronger model.
I think that a whole lot of what I do in prompt engineering is what's necessary to fully specify the output that I want.
A newer model may be less finicky, so I have a higher chance of getting it to work on the first try (and it's more reliable afterwards), but it's hard for me to imagine it needing a whole lot less prompt.
One of the idea is to just stuff all the documents in the prompt, which still keeps them private but avoids having to faff around with chunking, embedding, and vector stores. That’s not really the end of RAG as a concept, but it would change all the current tooling and infrastructure we built for it.
I don’t think RAG is going away, at least not because of this. But I expect new techniques to become available fairly regularly.
I just think that the capability of the model could radically change, such that however you structured your RAG pipeline, might need to be rewritten. More general problems could be solved by the model, that you were solving with some complicated contraption of prompts.
Don't buy a computer today with a six month delivery lead time, because there's a company that releases computers with a same-day lead time with several improved models coming out next week.
Note that the benchmarks used for comparison are basically measuring the model’s ability to understand financial content. In other words, reading comprehension for English, just in a specific domain. It shouldn’t really be surprising that a strong generalist model performs well here.
On the other hand, GPT-4 actually did worse on the NER task - labelling and tagging terms used in the text - vs their finetuned model. I assume the finetuned model was better at using the specific labels they were targeting.
Meaning, they used the same generalized foundation model that all of us have access to, with no special fine-tuning, no retrieval-augmented generation, etc.
I don't understand your point. To me GPT-4 is not a foundation model, it's been highly tuned for the chat task. Nobody outside of OpenAI knows what that fine-tuning really involved. So it's impossible to say how much finance specific data it was trained on (in pre-training or fine-tuning) or whether finance specific tasks were involved in fine-tuning.
What I was meaning was that ChatGPT is not a foundation model since it's been fine-tuned. Although the definition in the link is sufficiently broad you could choose to include it.
I can't tell from the OpenAI docs whether it's possible to access GPT-4 without the ChatGPT fine-tuning. If so, that'd make this result more meaningful. Otherwise, I just don't think you can draw any great conclusions from this.
A lot of organizations still think they should have their own [finetuned] model to provide a custom experience to their users, so that may come as a surprise for them.
Scaling laws basically guarantee that a sufficiently larger general model will usually beat a smaller specialist model. The misunderstanding is perhaps acceptable but the headline here is essentially restating a well known property of deep learning.
contrarian view - how these models actually operate at runtime is not understood.. the formal research papers repeat that over and over again. Therefore, there will be new twists and turns as these models evolve. With current technology stacks, the "bitter lesson" is looking good, yes. Will it always be so? no way to know it.
Yeah, the equivalent is: would it be better for a quant firm to spend 200 thousand dollars giving a first-class specialist education to a guy with an IQ of 95, or just hiring a guy with an IQ of 150 straight out of college.
We're not in the business of training models. We will never be as good as OpenAI / Anthropic etc.
Where the real value in applications is smarter prompting techniques and RAG. There is a lot of room at the bottom in doing "dumb" things and simply feeding models with the right context to deliver customer value.