They've only showed it working with static content - they'll need to do it with video (multiple synchronised cameras) and in real time for ant robotics application.
It'd be interesting to see what happened if they encoded an additional time parameter on each 'view' (input image pixel). Surely someone is already trying to extend this technique that way.