But they are already multi-modal. The Google one can do live streaming video understanding with a conversational in-out prompt. You can literally walk around with your camera and just chat about the world. No text to be seen (although perhaps under the covers it is translating everything to text, but the point is the user sees no text)
Fair, but OpenAI was doing that half year ago (though limited access; I myself got it maybe a month ago), and I haven't seen it yet translate into anything in practice, so I feel like it (and multimodality in general) must be a GPT-3 level ability at this point.
But I do expect the next qualitative change to come from this area. It feels exactly like what is needed, but it somehow isn't there just yet.