Convolution filters attend to a region around each pixel; not to every other pix...

andrewfong · on Feb 23, 2024

My understanding of this tech is pretty minimal, so please bear with me, but is the basic idea is something like this?

Before: Evaluate the image in a little region around each pixel against the prompt as a whole -- e.g. how well does a little 10x10 chunk of pixels map to a prompt about a "red sphere and blue cube". This is problematic because maybe all the pixels are red but you can't "see" whether it's the sphere or the cube.

After: Evaluate the image as a whole against chunks of the prompt. So now we're looking at a room, and then we patch in (layer?) a "red sphere" and then do it again with a "blue cube".

Is that roughly the idea?