A well-trained 8B model will already be over-saturated with information from the start. It will therefore easily forget much old information when fine-tuning it with new materials. It just doesn't have the capacity to take in too much information.
Don't get me wrong. I think an 70B or larger model would be worth fine-tuning, especially if it can be grown further with more layers.
> A well-trained 8B model will already be over-saturated with information from the start
Any evidence of that that I can look at? This doesn't match what I've seen nor have I heard this from the world-class researchers I have worked with. Would be interested to learn more.
Upon further thought, if fine-tuning involves adding layers, then the initial saturation should not matter. Let's say if an 8B model adds 0.8*2 = 1.6B of new layers for fine-tuning, then with some assumptions, a ballpark is that this could be good for 16 million articles for fine-tuning.
Don't get me wrong. I think an 70B or larger model would be worth fine-tuning, especially if it can be grown further with more layers.