Core to core latency data on large systems

bee_rider · on Nov 7, 2023

The NUMA nature of recent* chips has made me wonder if there’s ever going to be a movement to start using message passing libraries (like MPI) on shared memory machines.

* actually, not even that recent, Zen planted this hope in my brain.

nvartolomei · on Nov 7, 2023

Thread-per-core software architectures are doing this https://penberg.org/papers/tpc-ancs19.pdf

Real world examples are scylladb and Redpanda, both built on the seastar framework (C++ https://seastar.io/message-passing/).

And for rust there is glommio https://www.datadoghq.com/blog/engineering/introducing-glomm...

RedlineTriad · on Nov 8, 2023

There is also another thread-per-core implementation by ByteDance (TikTok) for Rust called Monoio with benchmarks[0] comparing it to Tokio and Glommio.

[0] https://github.com/bytedance/monoio/blob/master/docs/en/benc...

sapiogram · on Nov 8, 2023

Does thread per core necessarily imply message passing? I don't see why the two need to be related.

yencabulator · on Nov 8, 2023

The thread-per-core manifesto has a goal of not sharing data between cores, and thus the communication inside the process becomes message passing, handing off ownership of a chunk of data to the recipient core. This lack of sharing is what enables the performance (no locks etc needed, outside of the message passing).

This is a good watch (first half is pure background, second half talks about the motivation): https://www.youtube.com/watch?v=PbgTyCSDPrs

the_svd_doctor · on Nov 7, 2023

In HPC it's common to do a mix of MPI (message-passing / distributed memory) and OpenMP (shared memory) parallelism when running on big multicore (and obviously multi-node) machines. It helps with locality, among other things.

bee_rider · on Nov 8, 2023

This is what I do actually, and it works fairly well. Currently I do one MPI process per socket, but mostly just because the OpenMP code I’m calling is a library, and it doesn’t seem to scale well past one modern Xeon worth of cores.

I don’t know what I’d do if I had an old Zen machine, maybe map an MPI process to each chiplett.

My impression is that in the first generation Zen machines, the cost of communicating from one chiplett to another was really quite significant, but they’ve made good enough progress there that it is only something that the really hardcode folks care about.

adapteva · on Nov 8, 2023

Ten years too early....

https://parallella.org/2015/05/25/how-the-do-i-program-the-p...

wmf · on Nov 8, 2023

Nope, Parallela was the wrong thing at the time and it's still wrong. Cache is good.

rewmie · on Nov 8, 2023

> Nope, Parallela was the wrong thing at the time and it's still wrong.

Can you elaborate?

wmf · on Nov 8, 2023

It didn't have DRAM or caches. Programming with scratchpads is so difficult that people just give up.

imtringued · on Nov 8, 2023

Scratchpads are the memory equivalent of VLIW.

Processors can extract parallelism dynamically at runtime. They can also manage your memory automatically at run time. Better yet, they can utilize hardware resources instead of software resources. It is such an obvious win.

bee_rider · on Nov 8, 2023

Are you saying that scratchpads were like VLIW in the sense that they seemed like a really cool idea, but failed because people didn’t want to manually code things well enough to take advantage of them?

adapteva · on Nov 8, 2023

Absolute statements are bad.

senderista · on Nov 7, 2023

A good recent paper on implementing message passing over shared memory:

"Message Passing or Shared Memory: Evaluating the Delegation Abstraction for Multicores"

https://cs.brown.edu/~irina/papers/2013-opodis.pdf

imtringued · on Nov 8, 2023

I don't know where you got that idea from. There is a movement in the complete opposite direction with CXL. Don't waste your time with silly libraries, serialisation or networking. Have a rack that is filled with nothing but memory pooled RAM and then connect your servers (which still retain RAM as a L4 cache). You now have a huge shared memory machine with distributed CPUs using CXL for cache coherence accross the entire system. There have been benchmarks that kept 75% of the memory outside the server and the performance degradation was only 10% compared to keeping the entire data set on a single server.

menaerus · on Nov 9, 2023

> There have been benchmarks that kept 75% of the memory outside the server and the performance degradation was only 10% compared to keeping the entire data set on a single server.

Performance degradation would greatly depend on how much data was actually touched by the workload outside the server and not solely by the fact that 75% of the memory was attached through CXL, no?

NUMA latency I measured last time on a dual-socket Xeon (Haswell) system was around 130ns for non-local memory access and 90ns for local memory access. OTOH some numbers I found seem to imply that the CXL latency is ~200ns.

This means that on average CXL latency is almost 100% larger than NUMA so I think it is not realistic to have only 10% performance degradation unless most of your workload fits into L1/L2/L3 cache plus that 25% of local memory or your workload is more CPU bound rather than memory bound.

hinkley · on Nov 8, 2023

I keep thinking that Rust’s borrow semantics would be pretty good for hinting whether code should run on the same core or could be offloaded to another. Two modules that only communicate via small, read only messages could easily be on separate cores.

And on architectures where some cores share faster paths than others, gradations could be scheduled that way.

adgjlsfhk1 · on Nov 8, 2023

IMO, MPI is the wrong level to do this on. Most apps should either be using some form of mapreduce or not using parallelism beyond the numa node.

zozbot234 · on Nov 8, 2023

The map-reduce programming model is overly simplified. It cannot express useful primitives such as prefix scan, which is used all the time in parallel algorithms.

jauntywundrkind · on Nov 7, 2023

It'll be interesting to see how CXL shakes out. It might end up being not much more than cross socket access! 150ns to go between sockets is about what we see here & is in the realm of what CXL had been promising.

Having a super short lightweight protocol like CXL.mem to talk over such fast fabric has so much killer potential.

These graphs are always such a delight to see. It's a network map, of how well connected cores are, and they reveal so many particular advantages and diaadvantages of the greater systems architecture.

rsaxvc · on Nov 8, 2023

Back in the days before Oracle, Sun would sell you a dual socket Opteron desktop and you could add your own FPGA right on the hypertransport in the second socket.

Exciting to see that capability becoming more standardized with CXL.

Edit: phrasing.

loxias · on Nov 8, 2023

I, too, am excited for CXL. Not enough people got to _feel_ the awesome of pmem. I think if more people had, pmem would be in all our laptops, desktops, servers.

hinkley · on Nov 8, 2023

I was misreading these charts for too long. Maybe I still am.

Am I seeing that none of these processors implement a toroidal communication path? I thought that was considered basic cluster topology these days so I’m surprised that multi core chips don’t implement it.

twic · on Nov 8, 2023

If your chip is fabricated on a flat rectangular piece of silicon, that would involve links running from each edge, across the chip, to the other edge, in both orientations. I can imagine that would be very demanding of chip resources, slow, etc.

If your chip is fabricated on the surface of a torus, or on a rectangle in highly curved space, then it would be a very natural architecture. But i am not aware of any chips that are.

hinkley · on Nov 9, 2023

I would presume the first couple of layers of silicon would be wires instead of gates. At least at the edges. Top left to top right, Bottom left to bottom right, top left to bottom left, top right to bottom right.

The middle of the chip could contain logic.

formerly_proven · on Nov 7, 2023

It's almost poetic to have those mid-1990s Pentiums there, with about 2-3x the inter-socket latency of the current state-of-the-art, 30 years later.

undersuit · on Nov 11, 2023

I like the end of the article.

>If Pentium could run at 3 GHz and the FSB got a proportional clock speed increase, core to core latency would be just over 20 ns.

Ran the test against my closest equivalent.

CPU: Intel(R) Celeron(R) G5905T CPU @ 3.30GHz Num cores: 2 Num iterations per samples: 5000 Num samples: 300

1) CAS latency on a single shared cache line

           0       1   
      0
      1   25±0 

    Min  latency: 25.3ns ±0.2 cores: (1,0)
    Max  latency: 25.3ns ±0.2 cores: (1,0)
    Mean latency: 25.3ns

Just wish I had a dual socket Pentium for the last 40 years.

nwmcsween · on Nov 9, 2023

If I'm reading this right socket-to-socket latency hasn't really improved much in a long time, why?

gpderetta · on Nov 7, 2023

Very interesting. Now do bandwidth next!