Hacker News new | past | comments | ask | show | jobs | submit login

I’m impressed that it’s multimodal and includes audio. GPT-4V doesn’t include audio afaik.

Also I guess I don’t see it as critical that it’s a big leap. It’s more like “That’s a nice model you came up with, you must have worked real hard on it. Oh look, my team can do that too.”

Good for recruiting too. You can work on world class AI at an org that is stable and reliable.





That’s different. It’s essentially using whisper model for audio to text and that inputs to ChatGPT.

Multimodal would be watching YouTube without captions and asking “how did a certain character know it was raining outside?” Based on rain sound but no image of rain


I don't know if it's related to Gemini, but Bard seems to be able to do this by answering questions like "how many cups of sugar are called for in this video". Not sure if it relies on subtitles or not.

From https://bard.google.com/updates:

> Expanding Bard’s understanding of YouTube videos

> What: We're taking the first steps in Bard's ability to understand YouTube videos. For example, if you’re looking for videos on how to make olive oil cake, you can now also ask how many eggs the recipe in the first video requires.

> Why: We’ve heard you want deeper engagement with YouTube videos. So we’re expanding the YouTube Extension to understand some video content so you can have a richer conversation with Bard about it.


Interesting. Will take Bard for a spin.


Ah that’s right. I guess my question is, is it a true multimodal model (able to produce arbitrary audio) or is it a speech to text system (OpenAI has a model called Whisper for this) feeding text to the model and then using text to speech to read it aloud.

Though now that I am reading the Gemini technical report, it can only receive audio as input, it can’t produce audio as output.

Still based on quickly glancing at their technical report it seems Gemini might have superior audio input capabilities. I am not sure of this though now that I think about it.


One of the demo videos explicitly addresses this point: https://youtu.be/D64QD7Swr3s?si=_bBa9aPmqGbo-Iej


Oh that’s actually pretty good then. It also seems it does output audio despite the PDF from google I was reading saying otherwise. Hmm.


Google is stable and reliable?


They can certainly pretend they are for hiring purposes. Compared to a company that fired their CEO, nearly had the whole company walk out, then saw the board ousted and the CEO restored google does look more reliable.

Just don’t speak to xooglers about it. ;)


> Compared to a company that [...]

Time to press some keys on my keyboar-

> Just don’t speak to xooglers about it. ;)

Oh shit, nevermind, you get it.


:)




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: