I would like to hear more about this. I was quite surprised how difficult things were when I started dabbling in Opengl and I thought that there has to be a better way. I know that there are libraries that build on top of Opengl and the like, but then always its a sacrifice of the power that you could have. It seems weird to me that it is so difficult because conceptually it seems to me that the model could be closer to the CPU/Memory model that everyone is already familiar with. You just have some RAM and some processor(s) that are going to do some computations right? Although I guess what really makes it a mess is that there needs to be a connection between what the GPU and the CPU are doing. I don't know, I was a bit surprised by how difficult it was. Perhaps I just don't understand it well enough.
To attempt to explain (desktop) GPU architecture: you don't just have memory and a bunch of individual cores on a GPU like you would on a CPU. You've got memory, texture sampling units, various other fetch units, fixed-function blending/output units, raster units, a dispatcher and then a ton of processing elements. These are all things the programmer need to set up (through the graphics API). Each of those processing elements runs several warps (wavefronts in AMD terminology), each which contains 32 or 64 threads (vendor-dependent), that all have their own set of registers. The warp holds the actual instruction stream and can issue operations that occur on all or some of those threads. Branching is possible, but pretty limited unless it's the same for every invocation. So the programming styles/models are incompatible from the start.
Then the real problem is, since all shader invocations share those fixed-function units, if you need to reconfigure them to use a different set of textures, buffers, shaders, etc you have to bring the whole operation to a complete halt, reconfigure it and restart it. And, contrary to popular belief, GPUs are the exact opposite of fast - each shader invocation takes an enormous amount of time to run, which is traded for throughput. Stopping that thing means having to wait for the last pieces of work to trickle through (and then when starting back up, you have to wait for enough work to be pushed through that all the hardware can be used efficiently), which means a lot of time doing little work.
So if you're trying to deal with the above, any notion of keeping things separate and clean (in terms of what the hardware sees, anyways) immedietely goes out the window. That's why things like virtual texturing exist - to let you more or less pack every single texture you need into a single gargantuan texture and draw as much as possible using some God-shader (and also because heavy reliance on textures tends to work well on consoles). Then you also have to manage to make good use of those fixed-function units (which is where tiled rasterizers on mobile GPUs can become a problem), but that's a relatively separate thing.
Also: transfering data back and forth in itself isn't necessarily that bad in my experience (just finnicky), it's usually the delays and synchronization that gets you.