The improvement over ChatGPT are counted in (very) few percents. Does it mean they have entered a diminishing returns phase or is it that each percent is much harder to get compared to the previous ones ?
> We’re already starting to experiment with Gemini in Search, where it's making our Search Generative Experience (SGE) faster for users, with a 40% reduction in latency in English in the U.S., alongside improvements in quality.
This feels like Google achieved a more efficient inference. Probably a leaner model wrt GPT.
not sure, but you could also look at the inverse. e.g. a 90% to 95% improvement could also be interpreted as 10% failure to 5% failure, i.e. half the amount of failures, a very big improvement. It depends on a lot of things, but it's possible that this could feel like a very big improvement.
Training large language models is characterised by diminishing returns; the first billion training inputs reduce the loss more than the second billion, the second billion reduce the loss more than the third, etc. Similar for increases in size; the improvement is less than linear.
It may mean that the evaluations useful range of distinguishing inprovements is limited. If its a 0-100 score on defined sets of tasks that were set because they were hard enough to distinguish quality in models a while back, the rapid rate of improvement may mean that they are no longer useful in distinguishing quality of current models even aside from the problem that it is increasingly hard to stop the actual test tasks from being reflected in training data in some form.
Probably just reflects that they are playing catch-up with OpenAI, and it would not look good if they announced their latest, greatest (to be available soon) was worse that what OpenAI have been shipping for a while, so I assume that being able to claim superiority (by even the smallest amount) over GPT-4 was the gating factor for the this announcement.
I doubt LLMs are close to plateauing in terms of performance unless there's already an awful lot more to GPT-4's training than is understood. It seems like even simple stuff like planning ahead (e.g. to fix "hallucinations", aka bullshitting) is still to come.
They want to release immediately to please shareholders but only if they're beating SOTA in benchmarks. Therefore we will usually get something which beats SOTA by a little bit, because the alternative (aside from a huge breakthrough) would be to delay release longer which serves no business purpose.