If I ask an LLM to generate new words for some concept or category, it can do th...

mirekrusin · on July 24, 2024

Not letters, but tokens. Think that it's translating everything to/from Chinese.

abdullahkhalids · on July 24, 2024

How does that explain why the tokens for strawberry, melon and "Stellaberry" [1] are close to each other?

[1] Suggestion from chatgpt3.5 for new fruit name.

roywiggins · on July 24, 2024

Illiterate humans can come up with new words like that too without being able to spell, LLMs are modeling language without precisely modeling spelling.

coder543 · on July 25, 2024

The tokenizer system supports virtually any input text that you want, so it follows that it also allows virtually any output text. It isn’t limited to a dictionary of the 1000 most common words or something.

There are tokens for individual letters, but the model is not trained on text written with individual tokens per letter, it is trained on text that has been converted into as few tokens as possible. Just like you would get very confused if someone started spelling out entire sentences as they spoke to you, expecting you to reconstruct the words from the individual spoken letters, these LLMs also would perform terribly if you tried to send them individual tokens per letter of input (instead of the current tokenizer scheme that they were trained on).

Even though you might write a message to an LLM, it is better to think of that as speaking to the LLM. The LLM is effectively hearing words, not reading letters.