The arguments for not limiting yourself to prompt engineering are compelling, but I wouldn't call the whole practice snake oil. Fundamentally, it's about learning to communicate your desires and requirements effectively.
There's definitely a fair number of simple tricks with existing models which drastically improve output quality. It's usually simple things like making requirements explicit, e.g. "generate a few foo" vs "generate 5 foo". The single most effective phrase I've picked up has been to tell the model that whatever is being output was written by an expert. But a lot of my experimentation has been dealing with generating and exploring fantasy ideas.
Dismissing prompting is dismissing one of the biggest value propositions of RLHF in that you can affect future output in a steerable way using natural language.
Meanwhile the SOTA model in question didn't support fine tuning until last month, still only supports SFT, and costs significantly more than the base model.
The fact the article concludes with "hire us" says it all: great marketing strategy with the title though.
>In just five days, through thorough data collection and fine-tuning, we not only reduced the system message to under 200 tokens but also transformed the LLM into a robust reasoning tool using a custom JSON format.
That sure sounds like it involved prompt engineering.
Well yeah, for eons, snake oils were known for their efficacious healing properties, wherever those snakes lived. It only took a few Wild West colonists to ruin it for everyone, and most of us forgot how snake oil is the real deal. So now re-read the headline with this in mind...
> We've all heard complaints about GPT-3.5 Turbo, particularly when compared to its successor, GPT-4, seemingly struggling to follow instructions. Guess what? In our experience, this is a non-issue with a properly fine-tuned GPT-3.5 Turbo model. In fact, GPT-4 can serve as the "prompt engineer" that assists in generating the training data.
This is omitting the very very important detail that a finetuned gpt-3.5-turbo is 8x the cost of a normal gpt-3.5-turbo, and the output is not 8x better especially with, you know it, prompt engineering. (such as gpt-3.5-turbo's function calling/structured data support, which is prompt engineering at its core)
It's also missing the detail that properly finetuning a model is very hard to do well.
This article is snake oil. Its entire semantic value nets out to: "Prompt engineering is bad! You should hire us to do something else instead." It's a long-form advertisement.
I would find this article a lot more convincing if it went into more detail about the fine-tuning process they used and the results they got.
As it stands, it reads more like lead generation content for their agency as opposed to being a genuinely convincing explanation of what kind of problems can be better solved with a fine-tuned GPT 3.5 model.
I'm disappointed, because I'm desperately hungry for useful, detailed real-world success stories that use the new OpenAI fine-tuning mechanisms. This isn't that - it lacks the detail.
Fine tuning for gpt-3.5 came out like a month ago. Saying that anyone who doesn't default to that doesn't know what they are doing is disingenuous. I don't think he really believes that. The fact that they had 700+ tokens in the prompt is not a reason to discount the previous attempt without fine tuning.
I feel like I should be trying out fine tuning on more problems. But this kind of article is just insulting in an inaccurate way.
I think it's fair enough if you are getting better results with fine tuning because not a lot of people do that with OpenAI models. But I would like to see some details and proof in terms of the performance benefits.
I would love to see an article like this with details that was reasonable, fair and truthful. But I think this is so far off the mark, I am flagging it.
This is an extremely bad article, written with fancy words, filled with bad advice, delivered with a holier-than-though tone. Are people just upvoting this submission based on the headline?
Have fun fine tuning a new model every 6 months as ClosedAI deprecates models. Have fun paying 8x the cost for inference.
It's absolutely the case that LLMs give output that's inconsistent in quality, and the determining factor is your prompt. This is true for base models, fine tuned models, and everything else.
Obviously other things matter too, but the lowest hanging fruit is usually "prompt better."
I think the author of this post probably meant to caveat that what we call "prompt engineering" TODAY might tend towards snake oil, but prompt engineering _doesn't have to_ be snake oil, and it _doesn't have to_ promote black box mentalities[1]. What's more is that fine-tuning is certainly not a panacea - it's not particularly great an injecting net-new context into these foundation models. It's great when you want to "close the aperture" a bit in model outputs. Even suggesting that fine-tuning is somehow a replacement for crafting prompts is just incorrect.
A prompt is everything to a LLM, there is no other interaction. It would be like saying that knowing the details of bash is snake oil for a sysadmin. If you are not doing some sort of prompt engineering, you are not making the best out of your LLM.
And sure, prompt engineering doesn't replace fine-tuning. To continue with the bash analogy, you can't do everything effectively with bash, sometimes you need to write code in C or other languages. But that you can write C code doesn't make using bash effectively snake oil.
Fine tuning for GPT came out a whole month ago. Before that, the choice was mostly between prompt engineering with GPT or fine-tuning dramatically inferior "open source" models, right?
... Fine tuning for OpenAI's GPT LLM's has been available for years now, at least since the GPT-3 private beta if not earlier (and obviously you could train the open models yourself)
That's true, but it was expensive and until recently you could only tune older versions of GPT-3 lacking both instruction tuning and the code pre-training of the Codex models (from which GPT-3.5 is thought to descend). You had to want tuning so badly you were willing to 6x the token cost and go back in time 2 years.
So... the article says that bad prompt engineering is bad and they engineered the prompt to be better and therefore prompt engineering is snake oil? I'm confused.
Ehhh this article basically says “we are better at prompt engineering than some random other guy who sucked. Hire us.”
It’s a clever ad, but an ad nonetheless.
The author tries to turn the idea of prompt engineering into something so banal as to be meaningless, but then describes their follow up work in a way which I would consider to also be classified under “prompt engineering.”
Prompt engineering directly correlates with good communication and knowing how to steer a conversation. There’s no tricks to the game, but the amount of people I talk to that don’t know how to get what they want from an LLM, clearly shows that it’s a skill set to be acquired.
The issue is with the name. Perhaps prompt tinkering would have been a better choice.
Yeah in general I'm a little repulsed by things with 'engineering' tacked on the end. Tech seems to love borrowing credibility by trading on existing names.
I am sorry, but $5000 behind LangChain - you entered a failure mode.
You fixed it and went further etc. but most teams do not. So it may seem definitive - the conclusions about what Prompt Engineering is or isn’t. But what you’ve described barely scratches the surface on Prompt Engineering.
There's definitely a fair number of simple tricks with existing models which drastically improve output quality. It's usually simple things like making requirements explicit, e.g. "generate a few foo" vs "generate 5 foo". The single most effective phrase I've picked up has been to tell the model that whatever is being output was written by an expert. But a lot of my experimentation has been dealing with generating and exploring fantasy ideas.