Huh, interesting, I had just ... assumed CLIP's tokenizer was character based, l...

nl · on Oct 4, 2022

I started a proper reply but had to board a plane.

It's actually a byte-pair encoded (BPE is better than character encoding but can do the things you mentioned) list of things that includes words. You can find common English suffixes in it listed separately too.

spijdar · on Oct 4, 2022

Thanks for the responses, I really appreciate the help. My only background with ML is playing with LSTMs and simple sequence-to-sequence models back before transformers, and the last few days I've been trying to deep dive as much as I can into the "state-of-the-art". I dislike treating the technology as a magical black box...

nl · on Oct 4, 2022

Here's the response I half wrote before:

GPT (and many other modern NLP models) use byte-pair encoding. Your summary of the benefits of this is correct - it can deal novel words much better.

Byte-pair encoding (BPE) is better than character encoding because it can deal with unicode (and emojis).

CLIP uses a BPE encoding of the vocabulary: The transformer operates on a lower-cased byte pair encoding (BPE) representation of the text with a 49,152 vocab size

So strictly this vocabulary is NOT (just) words, it is common sequences of byte pairs. You can see this if you examine the vocabulary - you'll find things like "tive" which isn't a word but is a very common English suffix.