Now that context length seems abundant for most tasks, I'm wondering why sub-word tokens are still used. I'm really curious how character-based LLMs would compare. With 2 M context, the compute bottleneck fades away. I'm not sure though what role the vocabulary size has. Maybe a large size is critical, since the embedding already contains a big chunk of the knowledge. On the other hand, using a character-based vocabulary would solve multiple problems, I think, like glitch tokens and possibly things like arithmetic and rhyming capabilities. Implementing sub-word tokenizers correctly and training them seems also quite complex. On a character level this should be trivial.
The attention mechanism is vastly more efficient to train when it can attend to larger, more meaningful tokens. For inference servers, a significant amount of memory goes into the KV cache, and as you note, to build up the embedding through attention would then require correlating far more tokens, each of which is "less meaningful".
I think we may get to this point eventually, in the limit we will want multimodal LLMs that understand images and sounds down to the pixel and frequency, and it seems like for text, too, we will eventually want that as well.
Maybe you could just use a good-old 1D-CNN for the bottom 3-4 layers. Then the model has been able to combine characters into roughly token length chunks anyway.
Just make sure to have some big MLPs at the start too, to enrich the "tokens" with the information currently stored in the embedding tables.
The PagedAttention paper is a good starting point as it represents the first major open source inference engine that had "pretty good" batch performance for transformers.
Characters are not the semantic components of words—these are syllables. Generally speaking, anyway. I've got to imagine this approach would yield higher quality results than the roman alphabet. I'm curious if this could be tested by just looking at how LLMs handle English vs Chinese.
The minimal semantic parts of words are morphemes. Syllables are phonological units (roughly: the minimal unit for rhythmic purposes such as stress, etc)
Only in languages that have morphemes! This is hardly a universal attribute of language so much as an attribute of those that use an alphabet to encode sounds. It makes more sense to just bypass the encoding and directly consider the speech.
Besides, considering morphemes as semantic often results in a completely different meaning than we actually intend. We aren't trying to train a chatbot to speak in prefixes and suffixes, we're trying to train a chatbot to speak in natural language, even if it is encoded to latin script before output.
That's technically wrong. Every language has morphemes for the simple reason that every word is at least one morpheme. `cat` is a morpheme. `cats` is two morphemes (cat-s).
(The point about semantics is also technically wrong. You would first need to specify your view of semantic compositionality before such a point can be evaluated, but the usual views of semantics don't have any such consequence.)
> Every language has morphemes for the simple reason that every word is at least one morpheme.
Sure, if you define "morpheme" as a collection of syllables that's meaningful to people using alphabetic script. I don't see any benefit to this compared to working with syllables directly, which is a meaningful concept regardless of the script used to encode them.
> Sure, if you define “morpheme” as a collection of syllables
Cats, as noted, has two morphemes, despite having only one syllable. Syllables and morphemes are largely orthogonal, morphemes can be less than, equal to, or more than a syllable (and even when more than, may or may not start or end on a syllable boundary.)
(Also, syllables aren’t the minimal semantic units even of spoken speech, those are phonemes – a syllable consists of at least one phoneme, potentially more. But morphemes, even an alphabetic script if it isn’t perfectly phonetic, still don’t necessarily map to one or more phonemes, since is textual semantic unit may have no effect on pronunciation.)
1. latency, which would get worse if you have to sequentially generate more output
2. These models very roughly turn tokens -> "average meaning" on the embedding layer, followed by attention layers that combine the meanings, and feed forward layers that match the current meaning combination to some kind of learned archetype/prototype almost. When you move from word parts to characters all of that becomes more confusing (what's the average meaning of a?) and so I don't think there are good enough techniques to learn character-based models yet
In AI music generation we have much better results with large vocabulary sizes of 10^6 order, my uneducated guess is that's because transformers are not universal pattern recognizers, they can catch patterns on a certain granularity level only.