Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It is not unfeasible. It is absolutely realistic to do distributed finetuning of an 8B text model on previous generation hardware. You can add finetuning to your set of options for about the cost of one FTE - up to you whether that tradeoff is worth it, but in many places it is. The expertise to pull it off is expensive, but to get a mid-level AI SME capable of helping a company adopt finetuning, you are only going to pay about the equivalent of 1-3 senior engineers.

Expensive? Sure, all of AI is crazy expensive. Unfeasible? No



I don't consider a small 8B model to be worth fine-tuning. Fine-tuning is worthwhile when you have a larger model with capacity to add data, perhaps one that can even grow its layers with the data. In contrast, fine-tuning a small saturated model will easily cause it to forget older information.

All things considered, in relative terms, as much as I think fine-tuning would be nice, it will remain significantly more expensive than just making RAG or search calls. I say this while being a fan of fine-tuning.


> I don't consider a small 8B model to be worth fine-tuning.

Going to have to disagree with you on that one. A modern 8B model that has been trained on enough tokens is ridiculously powerful.


A well-trained 8B model will already be over-saturated with information from the start. It will therefore easily forget much old information when fine-tuning it with new materials. It just doesn't have the capacity to take in too much information.

Don't get me wrong. I think an 70B or larger model would be worth fine-tuning, especially if it can be grown further with more layers.


> A well-trained 8B model will already be over-saturated with information from the start

Any evidence of that that I can look at? This doesn't match what I've seen nor have I heard this from the world-class researchers I have worked with. Would be interested to learn more.


Upon further thought, if fine-tuning involves adding layers, then the initial saturation should not matter. Let's say if an 8B model adds 0.8*2 = 1.6B of new layers for fine-tuning, then with some assumptions, a ballpark is that this could be good for 16 million articles for fine-tuning.


The reason to fine tune is to get a model that performs well on a specific task. It could lose 90 percent of it's knowledge and beat the unturned model at the narrow task at hand. That's the point, no?


It is not really possible to lose 90% of one's brain and do well on certain narrow tasks. If the tasks truly were so narrow, you would be better off training a small model just for them from scratch.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: