I find this video really freaky. It’s like Gemini is a baby or very young child ...

kromem · on Dec 6, 2023

No, there's an odd disconnect between the impressiveness of the multimodal capabilities vs the juvenile tone and insights compared to something like GPT-4 that's very bizarre in application.

It is a great example of what I've been finding a growing concern as we double down on Goodhart's Law with the "beats 30 out of 32 tests compared to existing models."

My guess is those tests are very specific to evaluations of what we've historically imagined AI to be good at vs comprehensive tests of human ability and competencies.

So a broad general pretrained model might actually be great at sounding 'human' but not as good at logic puzzles, so you hit it with extensive fine tuning aimed at improving test scores on logic but no longer target "sounding human" and you end up with a model that is extremely good at what you targeted as measurements but sounds like a creepy toddler.

We really need to stop being so afraid of anthropomorphic evaluation of LLMs. Even if the underlying processes shouldn't be anthropomorphized, the expressed results really should be given the whole point was modeling and predicting anthropomorphic training data.

"Don't sound like a creepy soulless toddler and sound more like a fellow human" is a perfectly appropriate goal for an enterprise scale LLM, and we shouldn't be afraid of openly setting that as a goal.

nolist_policy · on Dec 6, 2023

https://www.youtube.com/watch?v=PJgo3BBgWDA