My recollection of CLIP is that it’s more of a text-language co-embedding, where you have two transformers, one which encodes images into vectors and one which encodes captions into vectors. Through a contrastive loss (positive pairs are captioned image pairs, negative pairs are random image-caption combinations), embeddings of positive image-caption pairs are brought together (i.e., made similar) and embeddings of negative image-caption pairs are made more dissimilar.
https://openai.com/index/clip/