Yes, I can confirm that's how I read the "just curve fitting" bit. Regarding the...

Yes, I can confirm that's how I read the "just curve fitting" bit.

Regarding the gibberish word to image issue - CLIP uses a text transformer trained by contrastive matching to images. That means it's different from GPT, where it trains to predict the probability of the next word. GPT would easily tell apart gibberish words from real words, or incorrect syntax because they would be low probability sequences. CLIP text transformer doesn't do that because of the task formulation, not because of an intrinsic limitation. It's not so mysterious after realising they could have used a different approach to have both the text embedding and filter out gibberish if they wanted.

A good analogy would be a Rorschach test - show an OOD image to a human asking him to caption it. They will still say something about the image, just like DALL-E will draw a fake word. It's because the human is expected to generate a phrase no matter if the image makes sense or not, and DALL-E has a similar demand. The task formulation explains the result.

The mapping from nonsense word to image is explained by the continuous embedding space of the prompt and the ability to generate images from noise of the diffusion model. Any point in the embedding space, even random ones, fall closer to some concepts and further from other concepts. The lucky concept most similar to the random embedding would trigger the image generation.