Hacker News new | past | comments | ask | show | jobs | submit login

What makes it native?



Good question.

Perhaps for audio and video is by directly integrating the spoken sound (audio mode -> LLM) rather than translating the sound to text and feeding the text to LLM (audio mode -> text mode -> LLM).

But to be honest I'm guessing here perhaps LLM experts (or LLM itself since they claimed comparable capability of human experts) can verify if this is truly what they meant by native multi-modal LLM.


It's highly unlikely for a generative model to be able to reason about language in this level when based on audio features alone. Gemini may use audio cues, but text tokens must be fed in the very early layers of the transformer for complex reasoning to be possible. But because the gemini paper only mentions a transformer architecture, I don't see a way for them to implement speech-to-text inside such architecture (while also allowing direct text input). Maybe native here means that such a stack of models was rather trained together.


The transformer model and architecture is not limited to text-based token input but again I'm not the expert on how this new LLM model namely Gemini are being implemented, and whether the text-based token is necessary. For Gemini, if Google has truly cracked the native multi-modal input without the limitation of text-based input then it's really novel and revolutionary as they claimed it to be.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: