It's highly unlikely for a generative model to be able to reason about language in this level when based on audio features alone. Gemini may use audio cues, but text tokens must be fed in the very early layers of the transformer for complex reasoning to be possible. But because the gemini paper only mentions a transformer architecture, I don't see a way for them to implement speech-to-text inside such architecture (while also allowing direct text input). Maybe native here means that such a stack of models was rather trained together.
The transformer model and architecture is not limited to text-based token input but again I'm not the expert on how this new LLM model namely Gemini are being implemented, and whether the text-based token is necessary. For Gemini, if Google has truly cracked the native multi-modal input without the limitation of text-based input then it's really novel and revolutionary as they claimed it to be.