That was my first thought too. I really like the idea of interconnected nodes array. There's something biological, thinking in topology and neighbours diffusion that I find appealing.
Data transfer is slow and power hungry - it's obvious that putting a little bit of compute next to every bit of memory is the way to minimize data transfer distance.
The laws of physics can't be broken, yet people demand more and more performance, so eventually the difficulty of solving this issue will be worth solving.
That minimizes the data transfer distance from that bit of memory to that bit of compute. But it increases the distance between that bit of (memory and compute) and all the other bits of (memory and compute). If your problem is bigger than one bit of memory, such a configuration is probably a net loss, because of the increased data transfer distance between all the bits.
Your last paragraph... you're right that, sooner or later, something will have to give. There will be some scale such that, if you create clumps either larger or smaller than that scale, things will only get worse. (But that scale may be problem-dependent...) I agree that sooner or later we will have to do something about it.
Cache hierarchies operate on the principle that the probability of a bit being operated on is inversely proportional to the time since it was last operated on.
Registers can be thought of in this context as just another cache, the memory closest to the compute units for the most frequent operations.
It's possible to have register-less machines (everything expressed as memory to memory operations) but it blows up the instruction word length, better to let the compiler do some of the thinking.
Indeed you can take this further and think of three address spaces:
- Visible register file. 4-6 bit address space, up to 2kb in size. Virtualized as hidden (hardware) registers. Single cycle access. Usually little or no access controls or fault handling, if it exists you can read/write it.
- Main memory, 32-64 bit address space. Virtualized as caches, main RAM and swap. Access may be as low as 5 cycles for L1d, hundreds for main RAM, up into millions if you hit the swap file. Straightforward layer of access controls: memory protection, segfault exceptions and so on.
- Far storage, URIs and so on. Variable-length address space, effectively infinite. Arbitrarily long access times, arbitrarily complex access controls and fallbacks.
Long ago I thought that, at least for very generic / task-agnostic operations such as wiping, moving, duplicating chunks of memory, a chip-on-dimm could be of use (but maybe this is already the case and I don't know about it)