Those are extremely uniform latencies. Seems like on these CPUs most benefits from NUMA-aware thread-pools will be coming from reduced contention - mostly synchronizing small subsets of cores, rather than the actual memory affinity.
Well, all of the memory is at IO die. I remember AMD docs outright recommend to make processor hide NUMA nodes from the workload as trying to optimize for it might not even do anything for a lot of workloads
That AMD slide (in the conclusion) claims their switching fabric has some kind of bypass mode to improve latency when utilisation is low.
So they have been really optimising that IO die for latency.
NUMA is already workload sensitive, you need to benchmark your exact workload to know if itβs worth enabling or not, and this change is probably going to make it even less worthwhile. Sounds like you will need a workload that really pushes total memory bandwidth to make NUMA worthwhile.
NUMA is only useful if you have multiple sockets, because then you have several I/O dies and you want your workload 1) to be closer to the I/O device and 2) avoid crossing the socket interconnect. Within the same socket, all CPUs shared the same I/O die, thus uniform latency.