It is not unfeasible. It is absolutely realistic to do distributed finetuning of...

OutOfHere · on June 1, 2024

I don't consider a small 8B model to be worth fine-tuning. Fine-tuning is worthwhile when you have a larger model with capacity to add data, perhaps one that can even grow its layers with the data. In contrast, fine-tuning a small saturated model will easily cause it to forget older information.

All things considered, in relative terms, as much as I think fine-tuning would be nice, it will remain significantly more expensive than just making RAG or search calls. I say this while being a fan of fine-tuning.

solidasparagus · on June 1, 2024

> I don't consider a small 8B model to be worth fine-tuning.

Going to have to disagree with you on that one. A modern 8B model that has been trained on enough tokens is ridiculously powerful.

OutOfHere · on June 1, 2024

A well-trained 8B model will already be over-saturated with information from the start. It will therefore easily forget much old information when fine-tuning it with new materials. It just doesn't have the capacity to take in too much information.

Don't get me wrong. I think an 70B or larger model would be worth fine-tuning, especially if it can be grown further with more layers.

solidasparagus · on June 1, 2024

> A well-trained 8B model will already be over-saturated with information from the start

Any evidence of that that I can look at? This doesn't match what I've seen nor have I heard this from the world-class researchers I have worked with. Would be interested to learn more.

OutOfHere · on June 2, 2024

Upon further thought, if fine-tuning involves adding layers, then the initial saturation should not matter. Let's say if an 8B model adds 0.8*2 = 1.6B of new layers for fine-tuning, then with some assumptions, a ballpark is that this could be good for 16 million articles for fine-tuning.

robrenaud · on June 2, 2024

The reason to fine tune is to get a model that performs well on a specific task. It could lose 90 percent of it's knowledge and beat the unturned model at the narrow task at hand. That's the point, no?

OutOfHere · on June 2, 2024

It is not really possible to lose 90% of one's brain and do well on certain narrow tasks. If the tasks truly were so narrow, you would be better off training a small model just for them from scratch.