The NUMA nature of recent* chips has made me wonder if there’s ever going to be a movement to start using message passing libraries (like MPI) on shared memory machines.
* actually, not even that recent, Zen planted this hope in my brain.
There is also another thread-per-core implementation by ByteDance (TikTok) for Rust called Monoio with benchmarks[0] comparing it to Tokio and Glommio.
The thread-per-core manifesto has a goal of not sharing data between cores, and thus the communication inside the process becomes message passing, handing off ownership of a chunk of data to the recipient core. This lack of sharing is what enables the performance (no locks etc needed, outside of the message passing).
In HPC it's common to do a mix of MPI (message-passing / distributed memory) and OpenMP (shared memory) parallelism when running on big multicore (and obviously multi-node) machines. It helps with locality, among other things.
This is what I do actually, and it works fairly well. Currently I do one MPI process per socket, but mostly just because the OpenMP code I’m calling is a library, and it doesn’t seem to scale well past one modern Xeon worth of cores.
I don’t know what I’d do if I had an old Zen machine, maybe map an MPI process to each chiplett.
My impression is that in the first generation Zen machines, the cost of communicating from one chiplett to another was really quite significant, but they’ve made good enough progress there that it is only something that the really hardcode folks care about.
Processors can extract parallelism dynamically at runtime. They can also manage your memory automatically at run time. Better yet, they can utilize hardware resources instead of software resources. It is such an obvious win.
Are you saying that scratchpads were like VLIW in the sense that they seemed like a really cool idea, but failed because people didn’t want to manually code things well enough to take advantage of them?
I don't know where you got that idea from. There is a movement in the complete opposite direction with CXL. Don't waste your time with silly libraries, serialisation or networking. Have a rack that is filled with nothing but memory pooled RAM and then connect your servers (which still retain RAM as a L4 cache). You now have a huge shared memory machine with distributed CPUs using CXL for cache coherence accross the entire system. There have been benchmarks that kept 75% of the memory outside the server and the performance degradation was only 10% compared to keeping the entire data set on a single server.
> There have been benchmarks that kept 75% of the memory outside the server and the performance degradation was only 10% compared to keeping the entire data set on a single server.
Performance degradation would greatly depend on how much data was actually touched by the workload outside the server and not solely by the fact that 75% of the memory was attached through CXL, no?
NUMA latency I measured last time on a dual-socket Xeon (Haswell) system was around 130ns for non-local memory access and 90ns for local memory access. OTOH some numbers I found seem to imply that the CXL latency is ~200ns.
This means that on average CXL latency is almost 100% larger than NUMA so I think it is not realistic to have only 10% performance degradation unless most of your workload fits into L1/L2/L3 cache plus that 25% of local memory or your workload is more CPU bound rather than memory bound.
I keep thinking that Rust’s borrow semantics would be pretty good for hinting whether code should run on the same core or could be offloaded to another. Two modules that only communicate via small, read only messages could easily be on separate cores.
And on architectures where some cores share faster paths than others, gradations could be scheduled that way.
The map-reduce programming model is overly simplified. It cannot express useful primitives such as prefix scan, which is used all the time in parallel algorithms.
It'll be interesting to see how CXL shakes out. It might end up being not much more than cross socket access! 150ns to go between sockets is about what we see here & is in the realm of what CXL had been promising.
Having a super short lightweight protocol like CXL.mem to talk over such fast fabric has so much killer potential.
These graphs are always such a delight to see. It's a network map, of how well connected cores are, and they reveal so many particular advantages and diaadvantages of the greater systems architecture.
Back in the days before Oracle, Sun would sell you a dual socket Opteron desktop and you could add your own FPGA right on the hypertransport in the second socket.
Exciting to see that capability becoming more standardized with CXL.
I, too, am excited for CXL. Not enough people got to _feel_ the awesome of pmem. I think if more people had, pmem would be in all our laptops, desktops, servers.
I was misreading these charts for too long. Maybe I still am.
Am I seeing that none of these processors implement a toroidal communication path? I thought that was considered basic cluster topology these days so I’m surprised that multi core chips don’t implement it.
If your chip is fabricated on a flat rectangular piece of silicon, that would involve links running from each edge, across the chip, to the other edge, in both orientations. I can imagine that would be very demanding of chip resources, slow, etc.
If your chip is fabricated on the surface of a torus, or on a rectangle in highly curved space, then it would be a very natural architecture. But i am not aware of any chips that are.
I would presume the first couple of layers of silicon would be wires instead of gates. At least at the edges. Top left to top right,
Bottom left to bottom right, top left to bottom left, top right to bottom right.
* actually, not even that recent, Zen planted this hope in my brain.