Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>If you insert characters to breaks the tokens down, it find the correct result: how many r's are in "s"t"r"a"w"b"e"r"r"y" ?

The issue is that humans don't talk like this. I don't ask someone how many r's there are in strawberry by spelling out strawberry, I just say the word.



Humans also constantly make mistakes that are due to proximity in their internal representation. "Could of"/"Should of" comes to mind: the letters "of" have a large edit distance from "'ve", but their pronunciation is very similar.

Especially native speakers are prone to the mistake as they grew up learning english as illiterate children, from sounds only, compared to how most people learning english as second language do it, together with the textual representation.

Psychologists use this trick as well to figure out internal representations, for example the rorschach test.

And probably, if you asked random people in the street how many p's there is in "Philippines", you'd also get lots of wrong answers. It's tricky due to the double p and the initial p being part of an f sound. The demonym uses "F" as the first letter, and in many languages, say Spanish, also the country name uses an F.


Until I was ~12, I thought 'a lot' was a single word.



Oh I thought essay was some kind of abbreviation for S.A. - short article maybe…



Atleast you learnt.


Yeah, but for most people, it would be because the don't know how to spell "Philippines" at all. Confoundingly, LLMs know exactly how to spell Strawberry and still get this wrong.


> I don't ask someone how many r's there are in strawberry by spelling out strawberry, I just say the word.

No, I would actually be pretty confident you don’t ask people that question… at all. When is the last time you asked a human that question?

I can’t remember ever having anyone in real life ask me how many r’s are in strawberry. A lot of humans would probably refuse to answer such an off-the-wall and useless question, thus “failing” the test entirely.

A useless benchmark is useless.

In real life, people overwhelmingly do not need LLMs to count occurrences of a certain letter in a word.


AI being the same as human, while AI fails at a task that any human can easily do means AI isn't human equivalent in an easily demonstrable way.

If full artificial intelligence, as we're being promised, falls short in this simple way.


Count the number of occurrences of the letter e in the word "enterprise".

Problems can exist as instances of a class of problems. If you can't solve a problem, it's useful to know if it's a one off, or if it belongs to a larger class of problems, and which class it belongs to. In this case, the strawberry problem belongs to the much larger class of tokenization problems - if you think you've solved the tokenization problem class, you can test a model on the strawberry problem, with a few other examples from the class at large, and be confident that you've solved the class generally.

It's not about embodied human constraints or how humans do things; it's about what AI can and can't do. Right now, because of tokenization, things like understanding the number of Es in strawberry are outside the implicit model of the word in the LLM, with downstream effects on tasks it can complete. This affects moderation, parsing, generating prose, and all sorts of unexpected tasks. Having a workaround like forcing the model to insert spaces and operate on explicitly delimited text is useful when affected tasks appear.


Humans also would probably be very likely to guess 2 r's if they had never seen any written words or had the word spelled out to them as individual letters before, which is kind of close to how lanugage models treat it, despite being a textual interface.


> Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe.

We are also not exactly looking letter by letter at everything we read.


Not exactly the same thing, but I actually didn't expect this to work.

https://chatgpt.com/share/4298efbf-1c29-474a-b333-c6cc1a3ce3...


On the other hand explain to me how you are able to read the word “spotvoxilhapentosh”.


Just because we normally aren't reading letter by letter, it doesn't mean we can't. We can recognize common words on sight, ignoring minor variations, because we've seen the words thousands or millions of times, but that doesn't somehow disable the much less frequently used ability to approach a brand new word.


I think that humans indeed identify words as a whole and do not read letter by letter.

However, this implies you need to know the word to begin with.

I can write "asdf" and you might be oblivious to what I mean. I can mention "adsf" to a JavaScript developer and he will immediately think of the tool versioning tool. Because context and familiarity is important.


I believe it's a bit more nuanced than that. Short ubiquitous words like "and" or "the" we instantly recognize at a glance, but long unfamiliar or rarer words we read from the beginning, one syllable or letter at a time, until pattern recognition from memory kicks in. All unconsciously, unless the word is so odd, out of place, mispelled, or unknown that it comes to conscious awareness and interrupts our reading.


"spot"

"vox"

"il"

"ha"

"pen"

"tosh"

is how I read it.

A lot of schools teach kids to read with a syllabic method... so... super close to the tokenization concept.


It's not a human. I imagine if you have a use case where counting characters is critical, it would be trivial to programmatically transform prompts into lists of letters.

A token is roughly four letters [1], so, among other probable regressions, this would significantly reduce the effective context window.

[1] https://help.openai.com/en/articles/4936856-what-are-tokens-...


This is the kind of task that you'd just use a bash one liner for, right? LLM is just wrong tool for the job.


Humans do chain-of-thought.

User: Write “strawberry” one letter at a time, with a space between each letter. Then count how many r’s are in strawberry.

gpt-3.5-turbo: ASSISTANT s t r a w b e r r y

There are 2 r's in strawberry.

After some experimenting, it seems like the actual problem is that many LLMs can’t count.

User: How many r’s are in the following sequence of letters:

S/T/R/A/W/B/E/R/R/Y

gpt-4o-mini: In the sequence S/T/R/A/W/B/E/R/R/Y, there are 2 occurrences of the letter "R."

Oddly, if I change a bunch of the non-R letters, I seem to start getting the right answer.


>I don't ask someone how many r's there are in strawberry by spelling out strawberry, I just say the word.

You don't ask a human being how many r's there are in strawberry at all. The only reason you or anyone else asks that question is because it's an interesting quirk of how LLMs work that they struggle to answer it in that format. It's like an alien repeatedly showing humans an optical illusion that relies on the existence of our (literal) blind spot and using it as evidence of our supposed lack of intelligence.


This is only an issue if you send commands to a LLM as you were communicating to a human.


> This is only an issue if you send commands to a LLM as you were communicating to a human.

Yes, it's an issue. We want the convenience of sending human-legible commands to LLMs and getting back human-readable responses. That's the entire value proposition lol.


Far from the entire value proposition. Chatbots are just one use of LLMs, and not the most useful one at that. But sure, the one "the public" is most aware of. As opposed to "the hackers" that are supposed to frequent this forum. LOL




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: