I don't think the underlying model is word based, but character based. You could...

nl · on Oct 3, 2022

No, it's word based.

The vocabulary is here: https://huggingface.co/openai/clip-vit-base-patch32/resolve/...

It is contextual though, so words in different orders mean different things.

spijdar · on Oct 4, 2022

Huh, interesting, I had just ... assumed CLIP's tokenizer was character based, like GPT's was. At least, I think GPT's is character based?

Is there any reason it couldn't be character based, besides the (presumably very large) increase in resources needed to train and run inference? This is all way out of my league, but seems like you could get interesting results from this, since (by my caveman understanding) this hypothetical transformer could make some sense of words it had never seen before, so spelling variants or neologisms and such.

nl · on Oct 4, 2022

I started a proper reply but had to board a plane.

It's actually a byte-pair encoded (BPE is better than character encoding but can do the things you mentioned) list of things that includes words. You can find common English suffixes in it listed separately too.

spijdar · on Oct 4, 2022

Thanks for the responses, I really appreciate the help. My only background with ML is playing with LSTMs and simple sequence-to-sequence models back before transformers, and the last few days I've been trying to deep dive as much as I can into the "state-of-the-art". I dislike treating the technology as a magical black box...

nl · on Oct 4, 2022

Here's the response I half wrote before:

GPT (and many other modern NLP models) use byte-pair encoding. Your summary of the benefits of this is correct - it can deal novel words much better.

Byte-pair encoding (BPE) is better than character encoding because it can deal with unicode (and emojis).

CLIP uses a BPE encoding of the vocabulary: The transformer operates on a lower-cased byte pair encoding (BPE) representation of the text with a 49,152 vocab size

So strictly this vocabulary is NOT (just) words, it is common sequences of byte pairs. You can see this if you examine the vocabulary - you'll find things like "tive" which isn't a word but is a very common English suffix.

jhbadger · on Oct 3, 2022

Thank you. This is really helpful. Yes, you don't know exactly how SD will respond, but for example you can grep celebrity names and can know whether SD has any chance of drawing a picture with them in it or not rather than just randomly guessing.

pwinnski · on Oct 3, 2022

It's a word list, so as I'm sure you've already figured out, you have to grep first and last names separately. For example, "jennifer" as a first name is token 19786, while "garner</w>" is token 20340. If you want "james garner" instead, looks like that's tokens 6963 and 20340. Except, since it's a word list, there's still no guarantee that either celebrity is necessarily represented until you try.