It's already multimodal, as entropy is... entropy. In sound, vision, touch and more, the essence of universal symmetry and laws get through such that the AI can generalize across information patterns, not specifically text -- think of it as input instead.
Try prompts like: https://news.ycombinator.com/item?id=35510705
Encode sounds, images, etc in low resolution, and the LLM will be able to describe directions, points in time in the song, etc.
These LLM can spit out an ASCII image of text, or a different language, or code, etc. They understand representation versus an object.