Given enough advances in hardware and software optimization, isn't it reasonable to think that if we connect this level of language model to speech-to-text + image-to-text models on the input side and robotic control system on the output side, and set up an online end-to-end reinforcement learning system, the product will be a convincingly sentient robot, at least on the surface? Or am I underestimating the difficulty of connecting these different models? Would like to hear from the experts on this.