Now that context length seems abundant for most tasks, I'm wondering why sub-wor...

AaronFriel · on May 14, 2024

The attention mechanism is vastly more efficient to train when it can attend to larger, more meaningful tokens. For inference servers, a significant amount of memory goes into the KV cache, and as you note, to build up the embedding through attention would then require correlating far more tokens, each of which is "less meaningful".

I think we may get to this point eventually, in the limit we will want multimodal LLMs that understand images and sounds down to the pixel and frequency, and it seems like for text, too, we will eventually want that as well.

thomasahle · on May 14, 2024

Maybe you could just use a good-old 1D-CNN for the bottom 3-4 layers. Then the model has been able to combine characters into roughly token length chunks anyway.

Just make sure to have some big MLPs at the start too, to enrich the "tokens" with the information currently stored in the embedding tables.

yk · on May 14, 2024

> a significant amount of memory goes into the KV cache

Is there a good paper (or talk) how inference looks at scale? (Kinda like ELI-using-single-gpus)

AaronFriel · on May 15, 2024

The PagedAttention paper is a good starting point as it represents the first major open source inference engine that had "pretty good" batch performance for transformers.

https://arxiv.org/pdf/2309.06180

darby_eight · on May 14, 2024

> On a character level this should be trivial.

Characters are not the semantic components of words—these are syllables. Generally speaking, anyway. I've got to imagine this approach would yield higher quality results than the roman alphabet. I'm curious if this could be tested by just looking at how LLMs handle English vs Chinese.

inbetween · on May 14, 2024

The minimal semantic parts of words are morphemes. Syllables are phonological units (roughly: the minimal unit for rhythmic purposes such as stress, etc)

darby_eight · on May 14, 2024

Only in languages that have morphemes! This is hardly a universal attribute of language so much as an attribute of those that use an alphabet to encode sounds. It makes more sense to just bypass the encoding and directly consider the speech.

Besides, considering morphemes as semantic often results in a completely different meaning than we actually intend. We aren't trying to train a chatbot to speak in prefixes and suffixes, we're trying to train a chatbot to speak in natural language, even if it is encoded to latin script before output.

inbetween · on May 15, 2024

That's technically wrong. Every language has morphemes for the simple reason that every word is at least one morpheme. `cat` is a morpheme. `cats` is two morphemes (cat-s).

(The point about semantics is also technically wrong. You would first need to specify your view of semantic compositionality before such a point can be evaluated, but the usual views of semantics don't have any such consequence.)

darby_eight · on May 15, 2024

> Every language has morphemes for the simple reason that every word is at least one morpheme.

Sure, if you define "morpheme" as a collection of syllables that's meaningful to people using alphabetic script. I don't see any benefit to this compared to working with syllables directly, which is a meaningful concept regardless of the script used to encode them.

dragonwriter · on May 16, 2024

> Sure, if you define “morpheme” as a collection of syllables

Cats, as noted, has two morphemes, despite having only one syllable. Syllables and morphemes are largely orthogonal, morphemes can be less than, equal to, or more than a syllable (and even when more than, may or may not start or end on a syllable boundary.)

(Also, syllables aren’t the minimal semantic units even of spoken speech, those are phonemes – a syllable consists of at least one phoneme, potentially more. But morphemes, even an alphabetic script if it isn’t perfectly phonetic, still don’t necessarily map to one or more phonemes, since is textual semantic unit may have no effect on pronunciation.)

inbetween · on May 15, 2024

You might not see any benefit, but that's what those words mean :) Grab any textbook, it is linguistics 101!

joaogui1 · on May 14, 2024

I would say 2 big problems are:

1. latency, which would get worse if you have to sequentially generate more output

2. These models very roughly turn tokens -> "average meaning" on the embedding layer, followed by attention layers that combine the meanings, and feed forward layers that match the current meaning combination to some kind of learned archetype/prototype almost. When you move from word parts to characters all of that becomes more confusing (what's the average meaning of a?) and so I don't think there are good enough techniques to learn character-based models yet

novaRom · on May 14, 2024

In AI music generation we have much better results with large vocabulary sizes of 10^6 order, my uneducated guess is that's because transformers are not universal pattern recognizers, they can catch patterns on a certain granularity level only.