Huh, interesting, I had just ... assumed CLIP's tokenizer was character based, like GPT's was. At least, I think GPT's is character based?
Is there any reason it couldn't be character based, besides the (presumably very large) increase in resources needed to train and run inference? This is all way out of my league, but seems like you could get interesting results from this, since (by my caveman understanding) this hypothetical transformer could make some sense of words it had never seen before, so spelling variants or neologisms and such.
I started a proper reply but had to board a plane.
It's actually a byte-pair encoded (BPE is better than character encoding but can do the things you mentioned) list of things that includes words. You can find common English suffixes in it listed separately too.
Thanks for the responses, I really appreciate the help. My only background with ML is playing with LSTMs and simple sequence-to-sequence models back before transformers, and the last few days I've been trying to deep dive as much as I can into the "state-of-the-art". I dislike treating the technology as a magical black box...
GPT (and many other modern NLP models) use byte-pair encoding. Your summary of the benefits of this is correct - it can deal novel words much better.
Byte-pair encoding (BPE) is better than character encoding because it can deal with unicode (and emojis).
CLIP uses a BPE encoding of the vocabulary: The transformer operates on a lower-cased byte pair encoding (BPE) representation of the text with a 49,152 vocab size
So strictly this vocabulary is NOT (just) words, it is common sequences of byte pairs. You can see this if you examine the vocabulary - you'll find things like "tive" which isn't a word but is a very common English suffix.
Is there any reason it couldn't be character based, besides the (presumably very large) increase in resources needed to train and run inference? This is all way out of my league, but seems like you could get interesting results from this, since (by my caveman understanding) this hypothetical transformer could make some sense of words it had never seen before, so spelling variants or neologisms and such.