I am not sure whether the videos are representative of real life performance or it is a marketing stunt but sure looks impressive. Reminds of the robot arm in Iron Man 1.
It's an impressive demo but perhaps you are misremembering Jarvis from Iron Man which is not only far faster but is effectively a full AGI system even at that point.
Sorry if this feels pedantic, perhaps it is. But it seems like an analogy that invites pedantry from fans of that movie.
The robot arms in the movie are implied to have their own AIs driving them; Tony speaks to the malfunctioning one directly several times throughout the movie.
Jarvis is AGI, yes, but is not what's being referred to here.
Not specifically trained on but most likely the Vision models have seen it. Vision models like Gemini flash/pro are already good at vision tasks on phones[1] - like clicking on UI elements and scrolling to find stuff etc. The planning of what steps to perform is also quite good with Pro model (slightly worse than GPT 4o in my opinion)