How is AI hallucinating words now? I thought that would have been the easiest th...

Arnavion · 2025-01-12T06:04:35 1736661875

IIUC the input to LLMs is tokenized not on word boundaries but some kind of inter-syllable boundaries, because then whatever the model associated with "task" will also apply to "tasking", "tasked", "taskmaster", etc for example. So a model making up compounds that don't exist would be fully possible and even desirable, especially since real humans do it with English all the time.

staticautomatic · 2025-01-12T07:56:16 1736668576

They’re called “lemma”

Arnavion · 2025-01-12T08:19:11 1736669951

The intent is the same, but as I understand it LLMs don't tokenize based on lemmas, though some of the tokens probably line up with them.