I imagine he means that when you reason in latent space the final answer is a smooth function of the parameters, which means you can use gradient descent to directly optimize the model to produce a desired final output without knowing the correct reasoning steps to get there.
When you reason in token space (like everyone is doing now) you are executing nonlinear functions when you sample after each token, so you have to use some kind of reinforcement learning algorithm to learn the weights.
I think there's a subtlety here about what makes (e.g. English) tokens different to points in latent space. Everything is still differentiable (at least in the ML sense) until you do random sampling. Even then you can exclude the sampling when calculating the gradient (or is this equivalent to the "manifold"?).
I don't see a priori why it would be better or worse to reason with the "superposition" of arguments in the pre-sampling phase rather than concrete realizations of those arguments found only after choosing the token. It may well be a contingent rather than necessary fact.
Could you explain this a bit please?