The case for the return of fine-tuning

deepsquirrelnet · 2025-10-19T18:37:06 1760899026

I go back and forth on this. A year ago, I was optimistic and I have had 1 case where RL fine tuning a model made sense. But while there are pockets of that, there is a clash with existing industry skills. I work with a lot of machine learning engineers and data scientists and here’s what I observe.

- many, if not most MLEs that got started after LLMs do not generally know anything about machine learning. For lack of clearer industry titles, they are really AI developers or AI devops

- machine learning as a trade is moving toward the same fate as data engineering and analytics. Big companies only want people using platform tools. Some ai products, even in cloud platforms like azure, don’t even give you the evaluation metrics that would be required to properly build ml solutions. Few people seem to have an issue with it.

- fine tuning, especially RL, is packed with nuance and details… lots to monitor, a lot of training signals that need interpretation and data refinement. It’s a much bigger gap than training simpler ML models, which people are also not doing/learning very often.

- The limited number of good use cases means people are not learning those skills from more senior engineers.

- companies have gotten stingy with sme-time and labeling

What confidence do companies have in supporting these solutions in the future? How long will you be around and who will take up the mantle after you leave?

AutoML never really panned out, so I’m less confident that platforming RL will go any better. The unfortunate reality is that companies are almost always willing to pay more for inferior products because it scales. Industry “skills” are mostly experience with proprietary platform products. Sure they might list “pytorch” as a required skill, but 99% of the time, there isn’t hardly anyone at the company that has spent any meaningful time with it. Worse, you can’t use it, because it would be too hard to support.

daemonologist · 2025-10-19T21:10:26 1760908226

Labels are so essential - even if you're not training anything, being able to quickly and objectively test your system is hugely beneficial - but it's a constant struggle to get them. In the unlikely event you can get budget and priority for an SME to do the work, communicating your requirements to them (the need to apply very consistent rules and make few errors) is difficult and the resulting labels tend to be messy.

More than once I've just done labeling "on my own time" - I don't know the subject as well but I have some idea what makes the neurons happy, and it saves a lot of waiting around.

I've found tuning large models to be consistently difficult to justify. The last few years it seems like you're better off waiting six months for a better foundation model. However, we have a lot of cases where big models are just too expensive and there it can definitely be worthwhile to purpose-train something small.

hommes-r · 2025-10-20T06:59:30 1760943570

My personal opinion is that true engineering, which revolves around turning complex theory into working practice, has seen a decline in grace. Why spend a lot of time trying to master the art of engineering if you can ride the wave of engineering services and get away with it?

In true hacker spirit, I don't think trying to train a model on a wonky GPU is something that needs an ROI for the individual engineer. It's something they do because they yearn to acquire knowledge.

sdenton4 · 2025-10-19T19:01:43 1760900503

Eventually someone will make a killing on doing actual outcome measurements instead of just trusting the LLMs, Michael Lewis will write a popular book about it, and the cycle will begin anew...

XenophileJKO · 2025-10-19T19:35:49 1760902549

I'm also seeing teams who expected big gains from fine tuning get incremental or moderate gains. Then they put it in production and regret the action as SOTA marches quickly.

I have avoided fine tuning because the models are currently improving at a rate that exceeds big corporate product development velocity.

deepsquirrelnet · 2025-10-19T19:48:45 1760903325

Absolutely the first thing you should try is a prompt optimizer. The GEPA optimizer (implemented in DSPy) often outperforms GRPO training[1]. But I think people are usually building with frameworks that aren't machine learning frameworks.

[1] https://arxiv.org/abs/2507.19457

simonw · 2025-10-19T16:20:11 1760890811

I ran a survey on Twitter over the past few days asking for successful case studies that produced economically valuable results from fine-tuning LLMs.

I ask a version of this every six months or so, and usually the results are quite disappointing.

This time I had more credible replies than I have had in the past.

Here's my thread with highlights: https://twitter.com/simonw/status/1979254349235925084

And in a thread viewer for people who aren't signed into Twitter: https://twitter-thread.com/t/1979254349235925084

Some of the most impressive:

Datadog got <500ms latency for their language natural querying feature, https://twitter.com/_brimtown/status/1979669362232463704 and https://docs.datadoghq.com/logs/explorer/search/

Vercel run custom fine-tuned models on v0 for Next.js generation: https://vercel.com/blog/v0-composite-model-family

Shopify have a fine-tuned vision LLM for analyzing product photos: https://shopify.engineering/leveraging-multimodal-llms

donkeyboy · 2025-10-19T17:45:07 1760895907

Finetuning is pretty much necessary for regression tasks. Also useful for classification since you can get the direct probabilities in case you want to do some thresholding.

daxfohl · 2025-10-19T22:23:38 1760912618

I imagine it's pretty bad risk to reward ratio for most companies. Especially when just tossing some stuff into your system prompt is an option.

simonw · 2025-10-19T23:18:37 1760915917

Yeah, that's my assumption too. Fine-tuning is really expensive in terms of skills and time needed to attempt it, and there's a very real chance that your attempts will fail to make a meaningful improvement over being smarter with your prompts.

Even worse, even if you DO get an improvement you are likely to find that it was a waste of time in a month or two when then next upgraded version of the underlying models are released.

The places it makes sense from what I can tell are mainly when you are running so many prompts that the cost saving by running a smaller, cheaper model can outweigh the labor and infrastructure costs involved in getting it to work. If your token spend isn't tens (probably hundreds) of thousands of dollars you're unlikely to save money like this.

If it's not about cost saving, the other reasons are latency and being able to achieve something that the root model just couldn't do.

Datadog reported a latency improvement, because fine-tuning let them run a much smaller (and hence faster) model. That's a credible reason if you are building high value features that a human being is waiting on, like live-typing features.

The most likely cases I've heard of for getting the model to do something it just couldn't do before mainly involve vision LLMs, which makes sense to me - training a model to be able to classify images that weren't in the training set might make more sense than stuffing more example images into the prompt (though models like Gemini will accept dozens of not hundreds of comparable images in the prompt, which can then benefit from prompt caching).

The last category is actually teaching it a new skill. The best example here are low-resource programming languages - Jane Street and OCaml or Morgan Stanley and Q for example.

Jane Street OCaml: https://www.youtube.com/watch?v=0ML7ZLMdcl4

Morgan Stanley Q: https://huggingface.co/morganstanley/qqWen-1.5B-SFT

daxfohl · 2025-10-20T01:59:50 1760925590

Have you heard of any attempts to bake MCP definitions into LoRA adapters? I've been wondering if that's a viable approach, so you don't have to put them all in context, and toggling them on and off would just be a matter of applying or unapplying the weights. That seems like it'd be more robust than putting "enable FooMCP" "disable FooMCP" etc in the context, which I'd think would trip up the LLM eventually. And it would avoid full rebuild of the KV cache that'd be required if you fully removed FooMCP from the context prefix.

Depending on use case you could either insert the LoRA weights as their own layers at runtime (no time to create, but extra layer to compute each token), merge them with existing layers (initial delay to merge layers, but no runtime penalty after), or have pre-merged models for common cases (no perf penalty but have to reserve more storage).

simonw · 2025-10-20T04:18:49 1760933929

I've not heard of anyone trying that, but I don't think I've been looking in the right kinds of places.

My current mental model of LoRA is that this would be unlikely to Work, but I've never used them so I don't really know what I'm talking about. Would be a very interesting experiment!

CaptainOfCoit · 2025-10-19T16:24:10 1760891050

If people have ideas for use cases where fine-tuning can make a big difference, but don't have the time/resources to try it out yourself yet want to see if it'll work, feel free to share your ideas as I'm currently creating a bunch of examples of this and could use some inspiration, I only have 3 real/confirmed use cases as of right now.

coredog64 · 2025-10-19T18:44:06 1760899446

Something that's in my personal backlog is fine-tuning of TrOCR for purse seine observer workbooks. The default TrOCR is expecting English words, and so the FAO species codes used in the workbook result in terrible accuracy. LLMs do poorly in this space because you'll commonly see repeats (e.g. 100 out of 120 samples all have the same species code) which then leads to hallucination.

CaptainOfCoit · 2025-10-20T10:44:04 1760957044

You might enjoy this :) https://news.ycombinator.com/item?id=45640594 (DeepSeek OCR)

leobg · 2025-10-19T18:44:40 1760899480

Many ppl think to fine tune an LLM on domain knowledge means to feed it chunked text of, say, psychology books. That is, of course, a wrong application if your goal is for the model to become an expert psychologist. You want the behavior of applying psychology, but you are training the behavior to write about it. TL;DR, many fine tuning fails are due to wrong dataset curation. On the orher hand, if yiu get the dataset right, you can get a 7B model outperform a 180B one.

ACCount37 · 2025-10-19T19:14:10 1760901250

Transfer learning is a thing. But the issue with the gap is that the datasets for "applying X" aren't easy to come by.

ijk · 2025-10-19T19:36:42 1760902602

There is an awful lot of "looking for my keys under the street light" going around these days. I've seen a bunch of projects proposed that are either based on existing data (but have no useful application of that data) or have a specific application (but lack the data and evaluation required to perform that task). It doesn't matter how good your data is if no one has any use for things like it, and it doesn't matter how neat your application would be if the data doesn't match.

I'm including things like RL metrics as data here, for lack of a better umbrella term, though the number of proposed projects that I've seen that decided that ongoing evaluation of actual effectiveness was a distraction from the more important task of having expensive engineers make expensive servers into expensive heatsinks is maddening.

ACCount37 · 2025-10-20T00:16:01 1760919361

The importance of having good metrics cannot be overstated.

On the "applying X" problem - this almost feels to me like another argument against fine tuning? Because it seems like Applying can be a surprisingly broad skill, and frontier lab AIs are getting good at Applying in a broad fashion.

meander_water · 2025-10-19T12:16:39 1760876199

A couple of examples I have seen recently which makes me agree with OP:

- PaddleOCR, a 0.9B model that reaches SOTA accuracy across text, tables, formulas, charts & handwriting. [0]

- A 3B and 8B model which performs HTML to json extraction at GPT-5 level accuracy at 40-80x less cost, and faster inference. [1]

I think it makes sense to fine tune when you're optimizing for a specific task.

[0] https://huggingface.co/papers/2510.14528

[1] https://www.reddit.com/r/LocalLLaMA/comments/1o8m0ti/we_buil...

alansaber · 2025-10-19T13:55:10 1760882110

This comes back to the SLM vs LLM debate (sizes in relative terms), where an SLM can be optimised for a specific task, and out-perform an LLM. But it's not worth it (time, effort) for most tasks unless 1. they are very sensitive to precision or 2. it is ultra-high volume.

soVeryTired · 2025-10-19T13:01:49 1760878909

Have you used PaddleOCR? I'm surprised they're claiming SOTA without comparing against Amazon Textract or Azure doc intelligence (LayoutLM v3 under the hood, as far as I know).

I've played around with doc recognition quite a bit, and as far as I can tell those two are best-in-class.

mejutoco · 2025-10-19T18:14:20 1760897660

Amazon textract is not great at multi colum layouts in my experience. Docupanda or some azure models beat it. Just my 2 cents.

gdiamos · 2025-10-19T14:16:19 1760883379

Just coming out of founding one of the first LLM fine tuning startups - Lamini - I disagree

Our thesis was that fine tuning would be easier than deep learning for users to adopt because it was starting from a very capable base LLM rather than starting from scratch

However, our main finding with over 20 deployments was that LLM fine tuning is no easier to use than deep learning

The current market situation is that ML engineers who are good enough at deep learning to master fine tuning can found their own AI startup or join Anthropic/OpenAI. They are underpaid building LLM solutions. Expert teams building Claude, GPT, and Qwen will out compete most users who try fine tuning on their own.

RAG, prompt engineering, inference time compute, agents, memory, and SLMs are much easier to use and go very far for most new solutions

bjornsing · 2025-10-19T14:25:56 1760883956

Will Anthropic/OpenAI really hire anyone who can fine-tune an LLM?

gdiamos · 2025-10-19T14:32:26 1760884346

They will hire anyone who can produce a model better than GPT5, which is the bar for fine tuning

Otherwise, you should just use gpt5

Preparing a few thousands training examples and pressing fine tune can improve the base LLM in a few situations, but it also can make the LLM worse at other tasks in hard to understand ways that only show up in production because you didn’t build evals that are good enough to catch them. It also has all of the failure modes of deep learning. There is a reason why deep learning training never took off like LLMs did despite many attempts at building startups around it.

Andrej karpathy has a rant about it that captures some of the failure modes of fine tuning - https://karpathy.github.io/2019/04/25/recipe/

criemen · 2025-10-19T17:31:53 1760895113

> They will hire anyone who can produce a model better than GPT5, which is the bar for fine tuning

Depends on what you want to achieve, of course, but I see fine-tuning at the current point in time primarily as a cost-saving measure: Transfer GPT5-levels of skill onto a smaller model, where inference is then faster/cheaper to run. This of course slows down your innovation cycle, which is why generally this is imo not advisable.

gdiamos · 2025-10-19T19:41:59 1760902919

I agree this is the main case where it makes sense.

But a recent trend that cut into the cost savings is that foundation model companies have started releasing small models. So you can build a use case with qwen 235B, then shrink down to 30B, or even all the way down to 0.6B if you really want to.

The smaller models lose some accuracy, but some use cases are solvable even by these smaller and much more efficient models.

kgwgk · 2025-10-19T19:34:14 1760902454

> but it also can make the LLM worse at other tasks

The problem is easily avoided by not using it for other tasks.

gdiamos · 2025-10-19T19:39:00 1760902740

Users often found it hard to know exactly where the boundaries are.

This is a reason why general purpose models shine. You don’t have to carefully characterize a task and put guard rails around it.

kgwgk · 2025-10-19T19:51:44 1760903504

There is also a reason why you don’t have general purpose applications. Most users understand that Excel is for data tables and Paint is for images even though some people have fun playing with the boundary and creating Excel paintings.

gdiamos · 2025-10-19T20:06:49 1760904409

This is exactly the intuition that leads to excitement about fine tuning.

However, I personally think that this intuition applies to products and interfaces, not to AI.

Intelligence and learning is general. Intelligence without generalization is memorization, which seems to be less useful in practice.

kgwgk · 2025-10-19T21:40:20 1760910020

What people use are products and interfaces, not "AI".

fragmede · 2025-10-20T06:32:59 1760941979

Interesting you bring up Excel. ChatGPT's chat interface is going to be Excel for the AI era. Everyone knows there's a better interface to be had, but it just works.

kgwgk · 2025-10-20T11:26:54 1760959614

In the pre-AI era Excel was not « the » interface. Most people didn’t use Excel at all!

yunwal · 2025-10-19T14:47:09 1760885229

It’s quite easy to produce a model that’s better than GPT-5 at arbitrarily small tasks. As of right now, GPT-5 can’t classify a dog by breed based on good photos for all but the most common breeds, which is like an AI-101 project.

gdiamos · 2025-10-19T14:53:09 1760885589

Try doing a head to head comparison using all LLM tricks available including prompt engineering, rag, reasoning, inference time compute, multiple agents, tools, etc

Then try the same thing using fine tuning. See which one wins. In ML class we have labeled datasets with breeds of dogs hand labeled by experts like Andrej, in real life users don’t have specific, clearly defined, and high quality labeled data like that.

I’d be interested to be proven wrong

I think it is easy for strong ML teams to fall into this trap because they themselves can get fine tuning to work well. Trying to scale it to a broader market is where it fell apart for us.

This is not to say that no one can do it. There were users who produced good models. The problem we had was where to consistently find these users who were willing to pay for infrastructure.

I’m glad we tried it, but I personally think it is beating a dead horse/llama to try it today

mountainriver · 2025-10-19T20:19:45 1760905185

There are tons of problems this simply doesn’t apply to. In the limited API world this may be true but agents are far from reliable

yunwal · 2025-10-19T15:02:02 1760886122

I mean, at the point where you’re writing tools to assist it, we are no longer comparing the performance of 2 LLMs. You’re taking a solution that requires a small amount of expertise, and replacing it with another solution that requires more expertise, and costs more. The question is not “can fine tuning alone do better than every other trick in the book plus a SOTA LLM plus infinite time and money?” The question is: “is fine tuning useful?”

gdiamos · 2025-10-19T15:28:29 1760887709

Fair didn’t seem to matter to users who just wanted to build solutions with reasonable time and budget

echelon · 2025-10-19T14:59:53 1760885993

If your customers can't fine tune, do it for them instead.

gdiamos · 2025-10-19T15:13:01 1760886781

How can you hire enough people to scale that while making the economics work?

Why would they join you rather than founding their own company?

CaptainOfCoit · 2025-10-19T16:28:20 1760891300

> How can you hire enough people to scale that while making the economics work?

Once you (as in you the person) have the expertise, what you need all the people for exactly? To fine-tuning you need to figure out the architecture, how to train, how to infer, pick together the dataset and then run the training (optionally setup a pipeline so the customer can run the "add more data -> train" process themselves). What in this process you need to hire so many people for?

> Why would they join you rather than founding their own company?

Same as always, in any industry, not everyone wants to lead and not everyone wants to follow.

gdiamos · 2025-10-19T17:59:32 1760896772

llm.finetune(data) is a leaky abstraction

Read Andrej’s blog that I linked earlier in the thread if you want to understand why.

CaptainOfCoit · 2025-10-19T18:52:53 1760899973

If it works it works? :shrug:

gdiamos · 2025-10-19T19:44:24 1760903064

The problem is that it doesn’t always work and when it does fail it fails silently.

Debugging requires knowing some small detail about your data distribution or how you did gradient clipping which take time and painstakingly detailed experiments to uncover.

CaptainOfCoit · 2025-10-20T09:54:57 1760954097

> The problem is that it doesn’t always work and when it does fail it fails silently.

Right, but why does that mean you need more employees? You need to figure out how to surface failures, rather than just adding more meat to the problem.

echelon · 2025-10-19T15:24:36 1760887476

> How can you hire enough people to scale that while making the economics work?

Pick the right customers.

> Why would they join you rather than founding their own company?

The network effects of having enough resources in one place. For having other teams deal with the training data, infrastructure, deployment, etc.

gdiamos · 2025-10-19T15:30:39 1760887839

I think you are saying to go after the very high end of the market.

That’s fair, one market segment of this is sometimes called sovereign compute.

Another common model that I have seen is to become the deepmind for one very large and important customer.

I think this works.

Der_Einzige · 2025-10-20T17:22:01 1760980921

Yup! That's why civit.ai doesn't exist right?

They'll pay for anyone that can personalize models to be meaningfully diverse.

danielmarkbruce · 2025-10-20T02:36:08 1760927768

I think you misunderstand what they are saying - doing a good job of fine tuning is difficult.

Training an LLM from scratch is trivial - training a good one is difficult. Fine tuning is trivial - doing a good job is difficult. Hitting a golf ball is trivial - hitting a 300 yard drive down the middle of the fairway is difficult.

echelon · 2025-10-19T14:58:41 1760885921

What models did you try to find tune? Were the models at the time even good enough to fine tune? Did they suffer from catastrophic forgetting?

We have a lot of more capable open source models now. And my guess is that if you designed models specifically for being fine tuned, they could escape many of the last generation pitfalls.

Companies would love to own their own models instead of renting from a company that seeks to replace them.

gdiamos · 2025-10-19T15:25:04 1760887504

We used the best models available and went from the Pythia/gpt2 to Deepseek generations.

One annoying part was switching to new and better models that came out literally every week.

I don’t think it substantially changes anything. If anything I think the release of more advanced models like qwen-next makes things like fp4, moe, and reasoning tokens an even higher barrier of entry.

empiko · 2025-10-19T11:56:06 1760874966

Fine-tuning is a good technique to have in a toolbox, but in reality, it is feasible only in some use cases. On one hand, many NLP tasks are already easy enough for LLMs to have near perfect accuracy and fine tuning is not needed. On the other hand, really complex tasks are really difficult to fine-tune and clevem data collection might be pretty expensive. Fine-tuning can help with the use cases somewhere in the middle, not too simple, not too complex, feasible for data collection, etc.

coldtea · 2025-10-19T13:56:02 1760882162

>Fine-tuning is a good technique to have in a toolbox, but in reality, it is feasible only in some use cases.

Yes, 100s of housands of them

brulard · 2025-10-19T21:39:19 1760909959

Care to elaborate what are some of those use cases?

coldtea · 2025-10-19T23:40:28 1760917228

Almost everything, isn't it?

From fine-tuning for coding assist, to medical applications, customer support, legal and financial use cases, various classification tasks, for government work, statistics, language learning, music, education, even for role playing game character AI...

I'd rather have a fine-tuned model specialized to any of those tasks and countless others when I'm doing one of those tasks, than a jack of all trades...

libraryofbabel · 2025-10-19T13:08:46 1760879326

What would you say is an example of one of those “middle” tasks it can help with?

CaptainOfCoit · 2025-10-19T13:16:53 1760879813

An example I just found worked very well with fine-tuning: I wanted to extract any frame that contained a full-screen presentation slide from a various videos I've archived, only when it's full-screen, and also not capture videos, and some other constraints.

Naturally I reached for CLIP+ViT which got me a ~60% success rate out of the box. Then based on that, I created a tiny training script that read `dataset/{slide,no_slide}` and trained a new head based on that. After adding ~100 samples of each, the success rate landed at 95% which was good enough to call it done, and circle back to iterate once I have more data.

I ended up with a 2.2K large "head_weights.safetensors" that increased the accuracy by ~35% which felt really nice.

melpomene · 2025-10-19T11:07:17 1760872037

This website loads at impressive speeds (from Europe)! Rarely seen anything more snappy. Dynamic loading of content as you scroll, small compressed images without looking like it (webp). Well crafted!

hshdhdhehd · 2025-10-19T11:17:30 1760872650

Magic of a CDN? Plus avoiding JS probably. Haven't checked source though.

stefanwebb · 2025-10-19T20:34:50 1760906090

Here's a blog post I wrote last week on the same topic: https://blog.oumi.ai/p/small-fine-tuned-models-are-all-you

I discuss a large-scale empirical study of fine-tuning 7B models to outperform GPT-4 called "LoRA Land", and give some arguments in the discussion section making the case for the return of fine-tuning, i.e. what has changed in the past 6 months

willybraun · 2025-10-19T20:51:25 1760907085

insightful, thanks

daxfohl · 2025-10-19T20:26:21 1760905581

Could you use LoRA adapters to free up your context with all the stuff that normally has to go into it? Coding standards and fuzzy preferences like "prefer short names" or "prefer functional style", reference materials, MCP definitions, etc.?

For training data, I was thinking you could just put all the stuff into context, then give it some prompts, and see how the responses differ over the baseline context. You could feed that into the fine tuner either as raw prompt and the output from the full-context model, or as like input="refactor {output from base model}", output="{output from full-context model}".

My understanding is that LoRA are composable, so in theory MCPs could be deployed as LoRA adapters. Then toggling on and off would not require any context changes. You just enable or disable the LoRA adapter in the model itself. Seems like this would help with context poisoning too.

funfunfunction · 2025-10-19T13:48:01 1760881681

Creator of inference.net / schematron here.

There is growing emphasis on efficiency as more companies adopt and scale with LLMs in their products.

Developers might be fine paying GPT-5-Super-AGI-Thinking-Max prices to use the very best models in Cursors, but (despite what some may think about Silicon Valley), businesses do care about efficiency.

And if you can fine-tune an 8b-parameter Llama model on GPT-5 data in < 48 hours and save $100k/mo, you're going to take that opportunity.

qrios · 2025-10-19T14:14:22 1760883262

> Finally, companies may have reached the ceiling of what can be achieved with prompting alone. Some want models that know their vocabulary, their tone, their taxonomy, and their compliance rules.

Together with speed and const, this is from my point of view this is the only "case" for the return of fine-tuning here. And this can be managed by context management.

With growing context sizes, first RAG replaced fine-tuning and later even RAG was replaced by just a good-enough prompt preparation for more and more usage pattern.

Sure, speed and costs are important drivers. But like with FPGAs vs. CPUs or GPUs, the development costs and delivery time for high-performance solutions, eliminate the benefit most the time.

oli5679 · 2025-10-19T11:01:05 1760871665

The OpenAI fine-tuning api is pretty good - you need to label an evaluation benchmark anyway to systematically iterate on prompts and context, and it’s often creates good results if you give it a 50-100 examples, either beating frontier models or allowing a far cheaper and faster model to catch up.

It requires no local gpus, just creating a json and posting to OpenAI

https://platform.openai.com/docs/guides/model-optimization

deaux · 2025-10-19T12:05:09 1760875509

They don't offer it for GPT-5 series, as a result much of the time fine-tuning Gemini 2.5-Flash is a better deal.

aininja · 2025-10-19T21:46:28 1760910388

2026 will be the year of specialized SLMs...enterprises care about more IP ownership/control, lower costs, and higher quality than the slow and expensive generic models that were not optimized for their use cases.

madiator · 2025-10-19T14:19:57 1760883597

I wrote about this recently as well: https://madiator.substack.com/p/finetuning-is-so-back

lorenzohess · 2025-10-19T15:00:34 1760886034

And here I am thinking we'd be discussing the teleological argument.

leblancfg · 2025-10-19T12:59:30 1760878770

Fine tuning was never really hard to do locally if you had the hardware. What I’d like to read in an article like this is more details into why they’re making a comeback.

Curious to hear others’ thoughts on this

AYBABTME · 2025-10-19T15:13:52 1760886832

Which minimum hardware spec would qualify as making this not really hard to do locally?

psadri · 2025-10-19T15:32:06 1760887926

Lots of caveats here in the following statement: if your application is not fully leaning in to frontier model capabilities, you are probably building a previous generation product.

jsight · 2025-10-19T14:07:34 1760882854

Return? Did it run away?

I don't think anyone thought fine tuning was dead.

marcosdumay · 2025-10-19T18:42:05 1760899325

There were many comments claiming that from around the end of 2023 to shortly before ChatGPT 5 was launched.

The main claim was that new models were much better than anything you could get your hands on to fine tune.

IMO, intuitively that never made sense. But I never tested it either.

spacecadet · 2025-10-19T15:16:35 1760886995

For some of us fine-tuning is a constant activity...

CuriouslyC · 2025-10-19T11:08:39 1760872119

Fine tuning by pretraining over a RL tuned model is dumb AF. RL task tuning works quite well.

HarHarVeryFunny · 2025-10-19T12:00:20 1760875220

You may have no choice in how the model you are fine tuning was trained, and may have no interest in verticals it was RL tuned for.

In any case, platforms like tinker.ai support both SFT and RL.

CuriouslyC · 2025-10-19T12:31:06 1760877066

Why would you choose a model where the trained in priors don't match your use case? Also, keep in mind that RL'd in behavior includes things like reasoning and how to answer questions correctly, so you're literally taking smart models and making them dumber by doing SFT. To top it off, SFT only produces really good results when you have traces that closely model the actual behavior you're trying to get the model to display. If you're just trying to fine tune in a knowledge base, a well tuned RAG setup + better prompts win every time.

imcritic · 2025-10-19T12:36:53 1760877413

Because you need a solution for your problem and the available tools are what they are and nothing else and you don't have enough resources to train your own model.