Given enough advances in hardware and software optimization, isn't it reasonable...

Given enough advances in hardware and software optimization, isn't it reasonable to think that if we connect this level of language model to speech-to-text + image-to-text models on the input side and robotic control system on the output side, and set up an online end-to-end reinforcement learning system, the product will be a convincingly sentient robot, at least on the surface? Or am I underestimating the difficulty of connecting these different models? Would like to hear from the experts on this.