"We have X11 for this" just means "we can't do this, use a different environment".
The point is, we should not HAVE to use a different environment. Images and media should be first-class citizens in a CLI just as much as they are in a GUI. There is nothing about a CLI that says it has to only handle text.
> we should not HAVE to use a different environment
Breaking things down into meaningfully separated subsystems is part of why Unix has been successful. What you're suggesting would mean hugely increasing the amount of complexity in SSH. It makes far more sense to use SSH as the transport solution, using a separate system to handle drawing/windows/graphics acceleration/user input into the GUI.
Not all servers support a graphical environment, and neither do all clients. This allows for lightweight servers and lightweight clients.
> There is nothing about a CLI that says it has to only handle text.
The command-line itself should be a relatively simple canvas, not a complex rendering subsystem. It's already rich enough to support TUIs, including mouse-click support. If you want more than that, use a proper GUI.
Do you have a specific gripe against X11? Again, it would be very much against the Unix philosophy to roll that highly complex functionality into the core SSH protocol. The problem would still be just as complex and challenging. Implementing a network-transparent GUI always is. You'd lose the separation of concerns, and you'd end up running two GUI systems rather than one.
If you want a very basic GUI over SSH without a full-blown GUI like with X11, you already have the option of using a TUI like Midnight Commander. You can preview images on the command-line, with a tool like imcat. [0]
The point is, we should not HAVE to use a different environment. Images and media should be first-class citizens in a CLI just as much as they are in a GUI. There is nothing about a CLI that says it has to only handle text.