Hacker News new | past | comments | ask | show | jobs | submit login

But the classic SIMD problem is matrix-multiplication, which doesn't need full memory bandwidth (because a lot of the calculations are happening inside of cache).

The question is: what kind of problems are people needing that want 400GB/s bandwidth on a CPU? Well, probably none frankly. The bandwidth is for the iGPU really.

The CPU just "might as well" have it, since its a system-on-a-chip. CPUs usually don't care too much about main-memory bandwidth, because its like 50ns+ away latency (or ~200 clock ticks). So to get a CPU going in any typical capacity, you'll basically want to operate out of L1 / L2 cache.

> Oh, wait, bloody Java GC might be a use for that. (LOL, FML or both).

For example, I know you meant the GC as a joke. But if you think of it, a GC is mostly following pointer->next kind of operations, which means its mostly latency bound, not bandwidth bound. It doesn't matter that you can read 400GB/s, your CPU is going to read an 8-byte pointer, wait 50-nanoseconds for the RAM to respond, get the new value, and then read a new 8-byte pointer.

Unless you can fix memory latency (and hint, no one seems to be able to do so), you'll be only able to hit 160MB/s or so, no matter how high your theoretical bandwidth is, you get latency locked at a much lower value.




Yeah the marking phase cannot be efficiently vectorized. But I wonder if it can help with compacting/copying phase.

Also for me the process sounds oddly familiar to vmem table walking. There is currently a RISC-V J extension drafting group. I wonder what they can come up with.


> The question is: what kind of problems are people needing that want 400GB/s bandwidth on a CPU? Well, probably none frankly.

It is needed for analytic databases, e.g. ClickHouse: https://presentations.clickhouse.com/meetup53/optimizations/


This is some seriously interesting stuff.

But they are demonstrating with 16 cores + 30 GB/s & 128 cores + 190 GB/s. And to my understanding they did not really mention what type of computational load did they perform. So this does not sound too ridiculous. M1 max is pairing 8 cores + 400GB/s.


Doesn't prefetching data into the cache more quickly assist in execution speed here?


How do you prefetch "node->next" where "node" is in a linked list?

Answer: you literally can't. And that's why this kind of coding style will forever be latency bound.

EDIT: Prefetching works when the address can be predicted ahead of time. For example, when your CPU-core is reading "array", then "array+8", then "array+16", you can be pretty damn sure the next thing it wants to read is "array+24", so you prefetch that. There's no need to wait for the CPU to actually issue the command for "array+24", you fetch it even before the code executes.

Now if you have "0x8009230", which points to "0x81105534", which points to "0x92FB220", good luck prefetching that sequence.

--------

Which is why servers use SMT / hyperthreading, so that the core can "switch" to another thread while waiting those 50-nanoseconds / 200-cycles or so.


I don't really know how the implementation of a tracing GC works but I was thinking they could do some smart memory ordering to land in the same cache-line as often as possible.

Thanks for the clarifications :)


Interestingly earlyish smalltalk VMs used to keep the object headers in a separate contiguous table.

Part of the problem though, is that the object graph walk pretty quickly is non contiguous, regardless of how it's laid out in memory.


But that’s just the marking phase, isn’t it? And most of it can be done fully in parallel, so while not all CPU cores can be maxed out with that, more often than not the original problem itself can be hard to parallelize to that level, so “wasting” a single core may very well be worth it.


You prefetch by having node->next as close as possible to node. You do that by using an allocator that tries very hard to ensure this.

Doesn't work that well for GC but for specific workloads it can work very nicely.


Yeah, that's a fine point.

I always like pointing out Knuth's dancing links algorithm for Exact-covering problems. All "links" in that algorithm are of the form "1 -> 2 -> 3 -> 4 -> 5" at algorithm start.

Then, as the algorithm "guesses" particular coverings, it turns into "1->3->4->5", or "1->4", that is, always monotonically increasing.

As such, no dynamic memory is needed ever. The linked-list is "statically" allocated at the start of the program, and always traversed in memory order.

Indeed, Knuth designed the scheme as "imagine doing malloc/free" to remove each link, but then later, "free/malloc" to undo the previous steps (because in Exact-covering backtracking, you'll try something, realize its a dead end, and need to backtrack). Instead of a malloc followed up by a later free, you "just" drop the node out of the linked list, and later reinsert it. So the malloc/free is completely redundant.

In particular: a given "guess" into an exact-covering problem can only "undo" its backtracking to the full problem scope. From there, each "guess" only removes possibilities. So you use the "maximum" amount of memory at program start, you "free" (but not really) nodes each time you try a guess, and then you "reinsert" those nodes to backtrack to the original scope of the problem.

Finally, when you realize that, you might as well put them all into order for not only simplicity, but also for speed on modern computers (prefetching and all that jazz).

Its a very specific situation but... it does happen sometimes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: