Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If you understand how LLMs work, you should disregard tests such as:

- How many 'r's are in Strawberry?

- Finding the fourth word of the response

These tests are at odds with the tokenizer and next-word prediction model. They do not accurately represent an LLM's capabilities. It's akin to asking a blind person to identify colors.



Ask a LLM to spell "Strawberry" one character per line. Claude's output, for example:

> Here's "strawberry" spelled out one character per line: s t r a w b e r r y

Most LLMs can handle that perfectly. Meaning, they can abstract over tokens into individual characters. Yet, most lack the ability to perform that multi-level inference to count individual 'r's.

From this perspective, I think it's the opposite. Something like the strawberry-tests is a good indicator how far the LLM is able to connect individually easy, but not readily interconnected steps.


The funny thing about those "tests" is that LLMs are judged by their ability to do that themselves, as opposed to their ability to write code that does it. The best LLMs still fail at doing the task themselves, because they fundamentally are not designed to do anything except predict tokens. But they absolutely can write code that does it perfectly, and can write code that does so many things better than that.


I'm not going to argue these are good tests, if you asked a coworker these questions they'd look at you weird, but what surprised me is how well you can encode a sentence never written down before, put it through base64 encoding, and then ask an llm to decode it. And the good models can do this surprisingly well.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: