A bit confusing for sure, but I think (not sure) I get what they're saying. Training a nn (for visual tasks at least) consists of training a model with much more dimensions (params) than the input space (eg: controller inputs + atari pixels). This contrasts with a lot of what humans do, which is take higher dimensional information (tons of data per second combining visual, audio, touch/vibration, etc) and synthesizing much lower dimensional models / heuristics / rules of thumb, like the example they give of the 5 second per mile rule for thunder.
Nonetheless, yes we do know certain brain structures like your image net analogy but the way you describe it, sounds a little bit of.
Our virtual cortex is not 'just a layer' its a component i would say and its optimized of detecting things.
Other components act differently with different structures.