GPT-4, without specialized training, beat a GPT-3.5 class model that cost $10M

LASR · on March 24, 2024

I lead AI teams at my company. I've advised leadership against any kind of training / fine-tuning anything.

We're not in the business of training models. We will never be as good as OpenAI / Anthropic etc.

Where the real value in applications is smarter prompting techniques and RAG. There is a lot of room at the bottom in doing "dumb" things and simply feeding models with the right context to deliver customer value.

redox99 · on March 24, 2024

That's a pretty odd stance. I've finetuned llama/mistral models that greatly outperform GPT4 with just a prompt.

You have to know when to RAG, finetune, or RAG+finetune.

simonw · on March 25, 2024

"I've finetuned llama/mistral models that greatly outperform GPT4 with just a prompt"

If you write about your experiments with that in detail I guarantee you'll get a lot of interest. The community is crying out for good, well documented, replicable examples of this kind of thing.

redox99 · on March 25, 2024

I'm so behind in this area. I had finetuned a model that was SOTA and worth publishing about in October, but procrastinated. I'm scared to check if somebody else already published on this topic.

singularity2001 · on March 24, 2024

    greatly outperform GPT4 *for* just a prompt

your overfitting to training data convinces no-one that you created a "better GPT4"

redox99 · on March 25, 2024

Do you always assume other people are incompetent? That's not very nice of you.

I mostly work on AI, so I know if I'm overfitting or not. It performs provably better in it's domain (a niche programming language). GPT4 can barely write a hello world for it.

I'm not creating a "better GPT4" general chatbot. I'm finetuning for a specific task.

lostmsu · on April 4, 2024

You are making an extraordinary claim, and they require extraordinary evidence. Unless presented it is a good idea to assume they are bogus.

kossTKR · on March 24, 2024

How narrow is the dataset to be outperforming greatly?

Just curious about what the usecase is for a 7b model in a business context - ie. what does it do?

redox99 · on March 25, 2024

Code assistant for a niche programming language that GPT4 knows very little about and barely gets a hello world right.

elforce002 · on March 24, 2024

This. I work in a startup and told upper management we need to keep focusing on ML models that bring tangible benefits to our customers and then, try to integrate LLMs into their current flow instead of pivoting completely to LLMs. It seems they valued the input and now we're going for a hybrid approach.

throwaway74432 · on March 24, 2024

Hear hear. I know a 3 person startup that has a "lead AI researcher" who is trying to train and fine-tune models. That's not their startup's purpose though... they have an actual product. So wtf are they doing? The lead AI guy thinks he's going to compete with these big companies and it's total fantasy.

LLMs are a commodity

qeternity · on March 24, 2024

That does indeed sound crazy. But finetuning is also a commodity these days. You can train a good Mistral LoRA in under 24 hours on a single consumer GPU. We’re talking about $10 of compute.

You can run a dozen of these LoRAs atop the same base model on the same infrastructure for a dozen specific use cases.

The inference quality, performance and cost can all be substantially better than GPT4 with prompting.

VirusNewbie · on March 24, 2024

Doesn't it entirely depend on how specialized the training data for a given fine tuned model might be?

seydor · on March 24, 2024

Your advise is based on what?

jnwatson · on March 24, 2024

It is trivial to fine tune these days. RAG is already irrelevant with large context windows.

SgtBastard · on March 24, 2024

A remarkable comment in that it is clear, confident and wrong.

Fine-tunes lead to catastrophic forgetting.

RAG is only irrelevant if you’re completely disinterested in cost and latency.

We also don’t have enough data to gauge performance of models >200k context window size when reasoning over inputs of that size, much of which will be irrelevant to any particular user. Multiple random needles in haystack tests work flawlessly, but rarely applies to real world activity.

Xenoamorphous · on March 24, 2024

> RAG is already irrelevant with large context windows

Just last Friday I took the contents of the 2024 folder of one of the teams at the company I work for, for which we use RAG at the moment. I dumped the text index, concatenated it and used Google’s API to return the token count, to see if it would fit in Gemini’s 1M context window; turned out it was 5.7M tokens. And that’s less than 3 months worth of documents for that team.

So yeah RAG is not dead yet, although I do question its usefulness, but that’s a separate topic.

greenavocado · on March 24, 2024

Did I read this correctly? You uploaded millions of words of your company's internal communications to Google?

Xenoamorphous · on March 24, 2024

I did. But this is under an enterprise deal with them that warrants privacy, not the generally available stuff. OpenAI has similar arrangements (Enterprise ChatGPT) and MS Azure before them.

simonw · on March 24, 2024

Citation needed on "trivial to fine tune".

danielmarkbruce · on March 24, 2024

There is no citation needed. It is indeed trivial to fine-tune. Doing a good job is another matter, but the claim is correct. Google around and find a blog post showing how.

The claim that RAG is dead is obviously wrong.

simonw · on March 24, 2024

For "citation needed", read "please link me to a blog post showing how, don't just tell me to Google for one".

The internet is full of blog posts about this. That doesn't mean they're actually good - I'd love to be pointed at one that has proven itself useful for someone (and definitely isn't just LLM blog-spam).

I don't care if it's trivial to fine-tune and get crap results - I care about fine-tuning where the result was worth the effort.

For the record, my favourite guide to fine-tuning is the section of this Jeremy Howard video that shows how to train a text-to-SQL model: https://www.youtube.com/watch?v=jkrNMKz9pWU&t=4850s

danielmarkbruce · on March 25, 2024

It's an internet forum, not an academic journal. Water tight arguments are not needed. If one wants to call bs, they can just do it, no need to dance around the topic by asking for a citation.

simonw · on March 25, 2024

OK, I call BS. Fine-tuning an LLM is not "trivial" - especially if you want to get useful results, as opposed to just being able to say "look, I fine-tuned an LLM".

danielmarkbruce · on March 25, 2024

Yup, largely agree.

wakaru44 · on March 25, 2024

Exactly, it's a forum not Twitter/reddit. Without references and citations this is no better than a bunch of random words, and it's hard to make any argument of substance.

The person asked for citations, leave it be, stop dudexplaining how Internet works for you please.

I call bs on your 2 comments.

danielmarkbruce · on March 25, 2024

They didn't ask for citations. They pointed out a citation was needed. It was a clever sounding way of calling bs. They admit as much.

Even when people sincerely ask for a citation on a debatable topic, on an internet forum, it's effectively saying "I won't be hear any opinion that doesn't match my own unless it's as water tight as a law of physics". Another form of this is "show me the data".

jnwatson · on March 28, 2024

Check out Google's AI Studio. It makes it easy to fine tune.

Disclaimer: I work for Google.

krasin · on March 24, 2024

Finetuning LLMs is currently the most promising way for next-gen robotics. One of such works (PaLM-e) among other things measured the impact of finetuning on general purpose tasks: https://twitter.com/DannyDriess/status/1632904698108821505

In short, an 8B model could degrade almost 10x after being finetuned on robotics tasks, while 500B model experiences a very minor degradation (~4%) and there's a hope that with a larger-sized model, it could become either zero or even negative (adding more experience improves general purpose reasoning).

What I am saying is that while GPT-4 could beat a finetuned GPT-3.5 class model, I predict good things about finetuned GPT-4 class models, when they become practical outside of OpenAI/Google.

TMWNN · on March 24, 2024

>there's a hope that with a larger-sized model, it could become either zero or even negative (adding more experience improves general purpose reasoning).

The best single work of fiction ever created about LLMs' capabilities (and, perhaps, dangers) is Colossus by Jones. Although I think the film is even better than the book, only the latter mentions how, despite being created specifically for US national defense, Colossus is also fed unrelated data including Shakespeare's sonnets, because its creators do not know if it could be important.

paulmd · on March 24, 2024

What good is a revolution without dancing?

gradascent · on March 25, 2024

Interesting! I would like to learn more about how AI is being applied to robotics. Do you have any suggestions for how to keep up with developments/ideas in this field?

krasin · on March 25, 2024

These two links could be a good start:

ALOHA-2: https://aloha-2.github.io/

RT-X: https://robotics-transformer-x.github.io/

hlfshell · on March 25, 2024

In October I wrote a blogpost on this subject: https://hlfshell.ai/posts/llms-and-robotics-papers-2023/

..and plan to do an updated version soon for much of what's been released since. I've also done work related to LLM and robotics integration, also on that site.

Happy to chat about it.

newswasboring · on March 25, 2024

Working my way through your blog post and it is so refreshing. Unfortunately my algorithm currently is showing me takes which are extreme on either end (like in your blog post).

> Technology’s largest leaps occur when new tools are provided to those that want to make things.

I love this sentence. And the general attitude of curiosity of your post.

hlfshell · on March 25, 2024

Thanks! Appreciate the kind words. I should have in the next month or so (interviewing and finishing my Master's, so there's been delays) a follow up that follows more advancements in the router style VLA, sensoiromotor VLM, and advances in embedding enriched vision models in general.

If you want a great overview of what a modern robotics stack would look like with all this, https://ok-robot.github.io/ was really good and will likely make it into the article. It's a VLA combined with existing RL methods to demonstrate multi-tasking robots, and serves as a great glimpes into what a lot of researchers are working on. You won't see these techniques in robots in industrial or commercial settings - we're still too new at this to be reliable or capable enough to deploy these on real tasks.

minimaxir · on March 24, 2024

Extremely hot LLM take: You will often get better results with few-shot prompting (with good examples) on a modern LLM than with a finetuned LLM.

Finetuning was the best option for weaker LLMs with lower context windows (e.g. the original GPT-3): both problems have been solved nowadays.

The cost economics are much better with few-shot prompting to modern LLMs too: input tokens are super cheap (especially with the recently-released Claude Haiku), so giving a lot of examples per call will still end up cheaper than finetuning.

Meanwhile, a finetuned ChatGPT costs 4-6x of normal ChatGPT usage.

zapperdulchen · on March 24, 2024

Seems like the bitter lesson is still right: http://www.incompleteideas.net/IncIdeas/BitterLesson.html

nialse · on March 24, 2024

For those who were oblivious to it, like myself, the bitter lesson is written by Richard S Sutton who invented reinforcement learning a long, long time ago.

pbronez · on March 24, 2024

I can’t access the article there… SSL error and then timeout. Here’s a link to the most recent WayBackMachine snapshot:

https://web.archive.org/web/20240321091803/https://www.incom...

port443 · on March 25, 2024

There's no SSL at all on that site, since it's http not https. Your browser is breaking the link.

tomrod · on March 24, 2024

This is an earth-shattering read.

Solvency · on March 24, 2024

Whoa this guy says "computation" and not grammatically bastardized techbrospeak "compute" like some neckbeard equivalent of a caveman!

For that alone I commend him.

jorvi · on March 24, 2024

Compute is.. I don’t know the exact English grammatical term but it’s like water. Computation is not.

“I have 1000 flops of compute” - works.

“I have 1000 flops of computation” - doesn’t work.

“That compute failed” - doesn’t work.

“That computation failed” - works.

They’re different.

thewakalix · on March 24, 2024

https://en.wikipedia.org/wiki/Mass_noun

Solvency · on March 24, 2024

Literally not true. Compute is a verb. Computation is the right word in all of those cases. Or computational <noun>.

xanderlewis · on March 24, 2024

As far as I know, it’s usually called an uncountable noun.

…but ‘computation’ is also uncountable, and your second sentence seems to be perfectly fine to me.

Your examples do not constitute an argument. You haven’t articulated the (purported) difference between the two words; you’ve just decided arbitrarily that some sentences don’t work, and not elaborated or explained at all.

I can make up words too, and provide example sentences: “karrotz are delicious” works. “carrots are delicious” doesn’t. “inside the karrotz” doesn’t work. “inside the carrots” does.

I don’t actually think there is any difference. The above comment about ‘brospeak’ was snarky but I do think it’s more of a cultural phenomenon than a semantic one — unless someone is willing to kindly explain the difference rather than just rolling their eyes!

What exactly is wrong with the sentence ‘this would require huge amounts of computation’? Saying ‘compute’ seems more to be a synonym of ‘computation’ that’s caught on recently than a useful gap-filling addition to the language. Again: reasoned arguments please. Or just ‘we think it sounds cool so we use it’ — that’s fine, too.

EDIT: pondering briefly, perhaps one could argue the difference is something like ‘you can own compute, but you can’t own computation.’ ‘Compute’ is the capacity to carry out computation. …although ‘compute’ seems to be used to refer to the ‘abstract’ computation being done as well as the computational resources, so I don’t know.

I’m stretching it. To be honest I’m not sure it’s a useful (or even real) distinction. I think it’s a matter of fashion, and that’s fine and normal.

lukebechtel · on March 24, 2024

Several MSFT AI/ML friends actively dissuaded me and my team from fine-tuning. They said that it's pretty clear in all their internal tests that it "lobotomizes" the general reasoning capabilities of the model, unless you're really careful.

"All work and no play makes GPT a very dull AI"

Ambix · on March 25, 2024

Yes, that's what I've seen from a lot of my experiments with fine-tuning. One should be really careful to not "lobotomize" already capable model and achieve better results at the end. It's trickier than seems from multiple of tutorials.

But I believe that most of the data stored in foundation models are just useless for some particular domain. So it's better to forget something, getting really useful info instead.

FrustratedMonky · on March 24, 2024

"bitter lesson that building in how we think we think does not work in the long run"

Guess. Stop trying to shape the NN. And let it learn on its own.

TMWNN · on March 24, 2024

>And let it learn on its own.

The best single work of fiction ever created about LLMs' capabilities (and, perhaps, dangers) is Colossus by Jones. Although I think the film is even better than the book, only the latter mentions how, despite being created specifically for US national defense, Colossus is also fed unrelated data including Shakespeare's sonnets, because its creators do not know if it could be important.

FrustratedMonky · on March 25, 2024

I think the movie Colossus still holds up today. Saw it last year, It was pretty scary.

thorum · on March 24, 2024

That might be true for finetuning ChatGPT 3.5, but if you can finetune a small model (7B or less) to perform on par with GPT-4, while being faster and private, that’s a different story.

smallnamespace · on March 24, 2024

You definitely can't in the general case (for example, your 7B model is never going to be able to help much with coding, fine tuning or no).

It can make sense if you have a particularly simple use case.

qeternity · on March 24, 2024

By definition you wouldn’t fine tune a 7B model to be generally as good at GPT4. You would just be trying to overfit some small amount of functionality in a narrow domain.

smallnamespace · on March 25, 2024

Yes but from the context of this discussion, we’re trying to figure out the “sweet spot” model size where it’s worth attempting fine tuning. My guess is it’s only worthwhile for matching simple tasks with small models, and any sufficiently complicated task it’s better to do few/zero shot instead.

viksit · on March 24, 2024

can you give a few pointers on articles or examples of this?

minimaxir · on March 24, 2024

A low-tech example to create a good blog post title for submission to Hacker News would be a system prompt like:

    You are an expert copywriter. Write five distinct blog post titles optimized for high clickthrough for Hacker News for the article the user provides.

    Your response must follow the style of these titles:
      - The ü/ü Conundrum
      - Why isn't preprint review being adopted?
      - Majority of web apps could just run on a single server
      - Weather Planning for Eclipse Day
      - PSChess – A chess engine in PostScript

Then provide the blog post as the user message input.

I just ran one of my blog posts (https://news.ycombinator.com/item?id=39495476) with the workflow through Claude Haiku and got this:

    Here are five distinct blog post titles optimized for high clickthrough on Hacker News for the article provided:

    1. Tipping ChatGPT: Does Offering Monetary Incentives Improve AI Text Generation?

    2. Quantifying the Impact of Incentives on Large Language Model Performance

    3. Carrot or Stick? Exploring the Effects of Positive and Negative Prompts on ChatGPT

    4. Gamifying AI: Using "Generation Golf" to Test ChatGPT's Ability to Follow Length Constraints

    5. The Curious Case of ChatGPT's Motivations: Can an AI Be Incentivized Like Humans?

Not bad titles, although more verbose than the 5 input examples I gave. I only gave 5 for simplicity: my main point is that you can give it a lot more than five and/or be more aggressive with constraints, like the blog post linked incidentially.

viksit · on March 24, 2024

interesting thank you.

intuitively, prompting like this to get an answer seems basically like the first part of a fine tuning process (more exemplars).

what is your thought here behind why reinforcing good output via a loss optimization is worse than the one shot example? does the model start to over fit at some point towards some local minima? and this is avoided in this scenario?

minimaxir · on March 24, 2024

Prompt engineering in general is necessary because LLMs optimize for the average output, and average output is not good. So LLMs need a slight nudge.

netdur · on March 24, 2024

Use Gemini 1.5 Pro, which has 1.5 million tokens. Prompt it with a logical question and observe it struggling to answer. Then, upload a book on logical thinking in PDF format and ask the same question again. Notice how it can now answer the question effectively.

jna_sh · on March 24, 2024

“Modern” is an extremely funny delineation given the small temporal window of this whole thing

hallqv · on March 24, 2024

This discussion is so dumb - finetuning a base model costs ~$1 with LORA/QLORA and can yield same performance as gpt-4, but at 1/100 of the cost per token.

What Bloomberg did for $10M was not finetuning..

simonw · on March 24, 2024

"finetuning a base model costs ~$1 with LORA/QLORA and can yield same performance as gpt-4, but at 1/100 of the cost per token"

That's a big claim - can you back that up with any examples?

Implicated · on March 25, 2024

I had opened a new tab back when this comment was just a few minutes old in hopes that when I came back there was some really great blog post linked with the details on the sorcery.

hallqv · on March 25, 2024

https://arxiv.org/pdf/2402.00841.pdf

kcorbitt · on March 24, 2024

IMO it's possible to over-generalize from this datapoint (lol). While it's true that creating a general "finance" model that's stronger than GPT-4 is hard, training a task-specific model is much easier. Eg. "a model that's better than GPT-4 at answering finance-related questions": very hard. "A model that's better than GPT-4 at extracting forward-looking financial projections in a standard format": very easy.

And in practice, most tasks people are using GPT-4 for in production are more like the latter than the former.

(Disclaimer: building https://openpipe.ai, which makes it super easy to productize this workflow).

MuffinFlavored · on March 24, 2024

Does anything currently beat GPT-4?

I saw some comments here say to check out Claude. From what I can tell, Claude hasn't figure out yet how to do the whole "generate Python code and run it in a Juptyer notebook" for math yet.

monsieurbanana · on March 24, 2024

Isn't that something you get from the infrastructure surrounding the llm? I thought the "running code" feature didn't need specific support from the llm, besides being able to output conforming json or code when asked to.

MuffinFlavored · on March 24, 2024

The LLM (Claude) currently doesn't know to not hallucinate numbers and instead write code + run it (something ChatGPT used to do but they fixed it)

simonw · on March 24, 2024

That's because the Claude web UI doesn't yet have the equivalent of the ChatGPT Code Interpreter tool (though they say they're working on it). That's not about the quality of the Claude 3 Opus model, which is the model which people think compares to or beats GPT-4. It's about the tooling that has been built for ChatGPT.

fragmede · on March 24, 2024

Code interpreter is pretty neat, because you can tell ChatGPT to write some code and to make sure the code works, and then it'll write you some bad code, realize it's bad, and then iterate on it until it gets to a place that it's happy with. (Maybe I should say passes its test rather than anthropomorphize ChatGPT as being "happy".)

cosmojg · on March 24, 2024

Right, like the other commenter suggested, that's an infrastructure-level thing, not a model-level thing. Given that you're talking about ChatGPT, I assume you aren't accessing GPT-3.5 or GPT-4 directly through the API but using the app or the interface provided at chat.openai.com. The magic that makes the kinds of interactions you're describing possible amounts to a bit of clever prompting sprinkled on top of some rather impressive frontend design and engineering.

Correctly prompted, even Mistral-7B can write and run code in response to questions, and it's a model that can run on laptops from half a decade ago, with two or three orders of magnitude fewer parameters that GPT-4.

MuffinFlavored · on March 25, 2024

> Right, like the other commenter suggested, that's an infrastructure-level thing, not a model-level thing.

By default, the ChatGPT "model" knows to not try to do math and instead write code to do the math then run it. I get that it's set up infrastructure wise to be able to run it, but why is Claude's main chat UI not trying to instead respond

"hey, do this calculation on your own since I can't" or something of this nature instead of responding to math incorrectly

cosmojg · on March 28, 2024

Because ChatGPT ships with a system prompt which instructs the underlying model to do exactly that. A similar web application could be developed for Claude, and it would perform similarly with the right prompt, as it's quite good at tool use.

For example, I'm able to get Claude-3-Opus to write Python and call a Python interpreter of its own accord when questioned about time series data in some of my data analysis workflows, though I haven't glued together a pretty GUI for it yet (e.g., plots are simply saved to disk). While I haven't run into any problems around calculations yet, I'm sure it wouldn't be too hard to further refine the system prompt and ensure that all calculations are performed or checked using Python.

avree · on March 24, 2024

Doesn't seem like you are very informed on how LLMs work, but just so you know, there are many different versions of Claude, just like how ChatGPT can use different versions of GPT.

rubymamis · on March 24, 2024

A programming task where Mistral-large beats both GPT-4 and Claude Opus: https://sdk.vercel.ai/s/Thi7RDx9e8VOZo1Ee6We5 (only Mistral got the current syntax)

Although based on other tasks, overall, GPT-4 seems to be the best, but by a very small margin, so I cancelled my subscription. Although the native mobile app is really great.

fragmede · on March 24, 2024

Is there a way to use Mistral-large with TTS and STT engines so you can converse with it like you can ChatGPT in the mobile app? it's really great on long drives for learning/talking about stuff, like a customized personal podcast.

rubymamis · on March 25, 2024

Exactly, I absolutely love this feature. And many times the conversation is quite natural and fluid (with good internet connection). I think I'll build something like that myself (:

cosmojg · on March 24, 2024

Do you prefer Mistral-Large or Claude-Opus?

rubymamis · on March 25, 2024

Not sure. Most of the time GPT-4 is better. Since I'm using Vercel AI playground[1], on almost every query I get a response from all models so it's easy to compare.

[1] https://sdk.vercel.ai/

thorum · on March 24, 2024

I don’t know what the people who say Claude 3 is better than GPT-4 are using it for. It’s been consistently worse for everything I’ve thrown at it.

Debugging a Python function this morning. Claude 3 Opus failed completely. GPT-4 found the bug, as well as two others I hadn’t even been looking for.

simonw · on March 24, 2024

I've had the opposite experience: coding prompts that GPT-4 makes mistakes on Claude 3 Opus gets right the first time.

As always, your results will vary based on your personal prompting style. My style apparently works great with Opus.

Here's one example: GPT-4 gave me code that was missing some async/await keywords: https://chat.openai.com/share/117fb1ad-6361-41e2-be59-110f32...

Claude 3 Opus with the same prompt got it right the first time: https://gist.github.com/simonw/2002e2b56a97053bd9302a34e0b83...

brianjking · on March 24, 2024

Yeah, Opus has entirely taken over any code specific use for me over ChatGPT 4 or OpenAI GPT-4 API.

Once Opus has the ability to run a code interpreter, it'll really be an exciting time.

geor9e · on March 24, 2024

I don't know what system and user prompt you are testing with, but as one anecdote, Claude 3 Opus (and only Opus) consistently gives me better coding answers than GPT-4. Maybe it's the type of stuff I am doing or how I phrase things, who knows. I was using GPT-4 since the day it came out but haven't felt like going back so far.

dragonwriter · on March 24, 2024

> Claude hasn't figure out yet how to do the whole "generate Python code and run it in a Juptyer notebook" for math yet.

GPT-4 didn't figure that out, either; that’s just tooling built around the model, not something the model “figures out”.

yieldcrv · on March 24, 2024

The benchmark is Sora or whatever Open AI is working on right now or next, not trying to beat the model released a year ago and still failing

so when looking at it that way, the real question is what do you need? all I need is Mixtral 7x8B Q5 in an 8,000 token context window, at the moment

I think there are plenty of other people that can design their applications and problems around lower fidelity tools, or just pursue something else

jxdxbx · on March 24, 2024

Claude is better than GPT 4 for my uses, and was able to help me do some simple coding things that GPT 4 could not. It’s worth trying at least.

drexlspivey · on March 24, 2024

According to Chatbot Arena where people vote on responses blindly and an ELO rating is determined for each LLM, gpt4 is on top slightly ahead of Claude 3 Opus

https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...

jasonjmcghee · on March 24, 2024

Claude Opus (largest v3 model) consistently outperforms GPT-4 for me. Better at following prompts, _feels_ much better.

treprinum · on March 24, 2024

Claude-2 in some tasks albeit it's a bit slower, Mistral on some tasks and it's a bit faster.

lukebechtel · on March 24, 2024

Claude is excellent for brainstorming, being a thought partner, general knowledge acquisition tasks, and creative writing.

The one mixed-bag weak spot I've found is in coding -- It tends to make more "d'oh" mistakes while coding, but comes up with more creative solutions at the same time ¯\_(ツ)_/¯

jonplackett · on March 24, 2024

FYI it was GPT-3.5 Class not GPT3.5.

A lot of models claim to be GPT3.5 class that clearly are not in the first place.

hulium · on March 24, 2024

There is also the open source FinGPT, that is claimed to beat GPT4 in some benchmarks at a fine tuning cost of $17.25.

https://github.com/AI4Finance-Foundation/FinGPT

potatoman22 · on March 24, 2024

One major advantage of FinGPT or Bloomberg's LLM is that the embeddings produced by the model can be used for downstream prediction tasks. GPT-4 does not expose its embeddings so it cannot be used for this.

bernawil · on March 25, 2024

sorry, noob here trying to make sense of this: you mean you can extract embeddings from the model file or that the embeddings are available in the repo and you can just use those files?

potatoman22 · on March 25, 2024

Kind of. You feed the LLM the input text for your prediction, you extract the activations of the final layer of the LLM (so the weights * the input of the previous layers), then use that activation vector, or embedding, as the input for a separate model. This separate model that uses the embedding can be any classifier or regression. A common use case for this is document classification.

ldjkfkdsjnv · on March 24, 2024

Fine tuning will disappear, no reason to invest so heavily in it. Also, most LLM code being written right now (prompts, RAG, etc) will be obsolete once the next model comes out. Anyone starting an LLM application startup is arguably wasting their time, wait until the next iteration is out. Then you will know whats possible.

minimaxir · on March 24, 2024

> Also, most LLM code being written right now (prompts, RAG, etc) will be obsolete once the next model comes out.

Not true. Most prompt techniques that work on current modern LLM models will work on different or future models, although it will require a QA pass for any regressions.

ldjkfkdsjnv · on March 24, 2024

Yeah they will work, but they will also be unnecessary. You will also bake all sorts of logic into your application that will be solved natively on the stronger model. If you believe in the scaling theory, then writing LLM applications is non sensical.

kergonath · on March 24, 2024

> If you believe in the scaling theory, then writing LLM applications is non sensical.

But not doing it is an opportunity cost. You don’t built skills, tooling and experience, and you don’t get feedback on what works and where you should go.

It’s like computers in the 1990s: there’s always a better one 6 months away, so if you wait for it to stabilise, then you don’t do anything for a decade. Just enjoy the ride, bearing in mind that things change very fast and some things will be obsolete next year.

mlyle · on March 24, 2024

> Yeah they will work, but they will also be unnecessary. You will also bake all sorts of logic into your application that will be solved natively on the stronger model.

I think that a whole lot of what I do in prompt engineering is what's necessary to fully specify the output that I want.

A newer model may be less finicky, so I have a higher chance of getting it to work on the first try (and it's more reliable afterwards), but it's hard for me to imagine it needing a whole lot less prompt.

Xenoamorphous · on March 24, 2024

How will a more powerful model be a substitute for RAG, which is usually used with private data that won’t be present in any training dataset?

kergonath · on March 24, 2024

One of the idea is to just stuff all the documents in the prompt, which still keeps them private but avoids having to faff around with chunking, embedding, and vector stores. That’s not really the end of RAG as a concept, but it would change all the current tooling and infrastructure we built for it.

I don’t think RAG is going away, at least not because of this. But I expect new techniques to become available fairly regularly.

ldjkfkdsjnv · on March 24, 2024

I just think that the capability of the model could radically change, such that however you structured your RAG pipeline, might need to be rewritten. More general problems could be solved by the model, that you were solving with some complicated contraption of prompts.

rkagerer · on March 24, 2024

"Don't buy a computer today, because the faster one is coming out tomorrow"

simonw · on March 24, 2024

Don't buy a computer today with a six month delivery lead time, because there's a company that releases computers with a same-day lead time with several improved models coming out next week.

treprinum · on March 24, 2024

OpenAI-related startups are likely using GPT-5 already. Waiting it out won't help other startups, they will be too far behind.

chintler · on March 24, 2024

$10 Million(M), not $10 Billion(B).

affgrff2 · on March 24, 2024

Not looking forward for the times when an AI costs as much as an aircraft carrier.

moffkalast · on March 24, 2024

At least with an aircraft carrier you can make your money back by holding a small country for ransom, har har.

CharlesW · on March 24, 2024

Thank you, fixed! Also, direct link to paper: https://arxiv.org/pdf/2305.05862.pdf

thorum · on March 24, 2024

Note that the benchmarks used for comparison are basically measuring the model’s ability to understand financial content. In other words, reading comprehension for English, just in a specific domain. It shouldn’t really be surprising that a strong generalist model performs well here.

On the other hand, GPT-4 actually did worse on the NER task - labelling and tagging terms used in the text - vs their finetuned model. I assume the finetuned model was better at using the specific labels they were targeting.

Rustwerks · on March 24, 2024

http://www.incompleteideas.net/IncIdeas/BitterLesson.html

jebarker · on March 24, 2024

How do they know GPT-4 received no specialized financial training?

CharlesW · on March 24, 2024

Meaning, they used the same generalized foundation model that all of us have access to, with no special fine-tuning, no retrieval-augmented generation, etc.

jebarker · on March 24, 2024

I don't understand your point. To me GPT-4 is not a foundation model, it's been highly tuned for the chat task. Nobody outside of OpenAI knows what that fine-tuning really involved. So it's impossible to say how much finance specific data it was trained on (in pre-training or fine-tuning) or whether finance specific tasks were involved in fine-tuning.

CharlesW · on March 24, 2024

> To me GPT-4 is not a foundation model…

It is. https://www.adalovelaceinstitute.org/resource/foundation-mod...

jebarker · on March 24, 2024

What I was meaning was that ChatGPT is not a foundation model since it's been fine-tuned. Although the definition in the link is sufficiently broad you could choose to include it.

I can't tell from the OpenAI docs whether it's possible to access GPT-4 without the ChatGPT fine-tuning. If so, that'd make this result more meaningful. Otherwise, I just don't think you can draw any great conclusions from this.

doctorpangloss · on March 24, 2024

The instruction fine tuning is what manifests knowledge and reasoning.

rmbyrro · on March 24, 2024

GPT is general purpose, it's not fine tuned for specific topics. A fine tuned model is tuned to a specific subject.

__loam · on March 24, 2024

GPT-4 cost like $100m so I don't think this is surprising?

rafaelero · on March 24, 2024

A lot of organizations still think they should have their own [finetuned] model to provide a custom experience to their users, so that may come as a surprise for them.

ShamelessC · on March 24, 2024

Scaling laws basically guarantee that a sufficiently larger general model will usually beat a smaller specialist model. The misunderstanding is perhaps acceptable but the headline here is essentially restating a well known property of deep learning.

mistrial9 · on March 24, 2024

contrarian view - how these models actually operate at runtime is not understood.. the formal research papers repeat that over and over again. Therefore, there will be new twists and turns as these models evolve. With current technology stacks, the "bitter lesson" is looking good, yes. Will it always be so? no way to know it.

__loam · on March 24, 2024

How long ago was the Bitter Lesson written?

atleastoptimal · on March 24, 2024

Yeah, the equivalent is: would it be better for a quant firm to spend 200 thousand dollars giving a first-class specialist education to a guy with an IQ of 95, or just hiring a guy with an IQ of 150 straight out of college.