There were two step changes: ChatGPT/GPT-3.5, and GPT-4. Everything after feels incremental. But that's perhaps understandable. GPT-4 established just how many tasks could be done by such models: approximately anything that involves or could be adjusted to involve text. That was the categorical milestone that GPT-4 crossed. Everything else since then is about slowly increasing model capabilities, which translated to which tasks could then be done in practice, reliably, to acceptable standards. Gradual improvement is all that's left now.
Basically how progress of everything ever looks like.
The next huge jump will have to again make a qualitative change, such as enabling AI to handle a new class of tasks - tasks that fundamentally cannot be represented in text form in a sensible fashion.
But they are already multi-modal. The Google one can do live streaming video understanding with a conversational in-out prompt. You can literally walk around with your camera and just chat about the world. No text to be seen (although perhaps under the covers it is translating everything to text, but the point is the user sees no text)
Fair, but OpenAI was doing that half year ago (though limited access; I myself got it maybe a month ago), and I haven't seen it yet translate into anything in practice, so I feel like it (and multimodality in general) must be a GPT-3 level ability at this point.
But I do expect the next qualitative change to come from this area. It feels exactly like what is needed, but it somehow isn't there just yet.
Basically how progress of everything ever looks like.
The next huge jump will have to again make a qualitative change, such as enabling AI to handle a new class of tasks - tasks that fundamentally cannot be represented in text form in a sensible fashion.