If you understand how LLMs work, you should disregard tests such as: - How many ...

firebaze · 2025-01-02T18:48:01 1735843681

Ask a LLM to spell "Strawberry" one character per line. Claude's output, for example:

> Here's "strawberry" spelled out one character per line: s t r a w b e r r y

Most LLMs can handle that perfectly. Meaning, they can abstract over tokens into individual characters. Yet, most lack the ability to perform that multi-level inference to count individual 'r's.

From this perspective, I think it's the opposite. Something like the strawberry-tests is a good indicator how far the LLM is able to connect individually easy, but not readily interconnected steps.

darksaints · 2025-01-02T20:49:48 1735850988

The funny thing about those "tests" is that LLMs are judged by their ability to do that themselves, as opposed to their ability to write code that does it. The best LLMs still fail at doing the task themselves, because they fundamentally are not designed to do anything except predict tokens. But they absolutely can write code that does it perfectly, and can write code that does so many things better than that.

maccam912 · 2025-01-02T19:21:54 1735845714

I'm not going to argue these are good tests, if you asked a coworker these questions they'd look at you weird, but what surprised me is how well you can encode a sentence never written down before, put it through base64 encoding, and then ask an llm to decode it. And the good models can do this surprisingly well.