Vision models need the pixel ordering to match the one they have been trained on...

Vision models need the pixel ordering to match the one they have been trained on, in order to work.

They won't generalize after training to transformations of the data that they haven't been trained on, even simple ones such as rotations, whereas humans will.

So I would argue that vision model do use the "underlying structure", and even that one of their problems is that they make use of some of the "underlying structures" that are not actually important, such as image luminosity, rotations etc. I think people usually augments the data with these transformations beforehand during preprocessing to enforce invariance.