Language is much more expressive than your fingers for certain tasks (and vice-versa). This design advocates for the use of _both_, not one or the other.
I think it's very unlikely that any significant number of people can type faster than they can speak.
Depending on the complexity of the task, you might very well have to press "hundreds of buttons" in order to achieve the same result as a single sub vocalized voice command. Not to mention that speech can be far more intuitive than keyboard shortcuts or nested sub-menus for certain tasks.
Again though, this isn't about replacing keyboard input, it's about supplementing it.
Yeah, but I remember spending 20 minutes driving on the highway and trying to get google assistant to play the next episode of my podcast (Not the most recent! The NEXT ONE. Gaah)
It would be 4 taps to get done, and it is impossible to do via voice.