But the classic SIMD problem is matrix-multiplication, which doesn't need full m...

znwu · on Oct 18, 2021

Yeah the marking phase cannot be efficiently vectorized. But I wonder if it can help with compacting/copying phase.

Also for me the process sounds oddly familiar to vmem table walking. There is currently a RISC-V J extension drafting group. I wonder what they can come up with.

zX41ZdbW · on Oct 18, 2021

> The question is: what kind of problems are people needing that want 400GB/s bandwidth on a CPU? Well, probably none frankly.

It is needed for analytic databases, e.g. ClickHouse: https://presentations.clickhouse.com/meetup53/optimizations/

znwu · on Oct 19, 2021

This is some seriously interesting stuff.

But they are demonstrating with 16 cores + 30 GB/s & 128 cores + 190 GB/s. And to my understanding they did not really mention what type of computational load did they perform. So this does not sound too ridiculous. M1 max is pairing 8 cores + 400GB/s.

carlhjerpe · on Oct 18, 2021

Doesn't prefetching data into the cache more quickly assist in execution speed here?

dragontamer · on Oct 18, 2021

How do you prefetch "node->next" where "node" is in a linked list?

Answer: you literally can't. And that's why this kind of coding style will forever be latency bound.

EDIT: Prefetching works when the address can be predicted ahead of time. For example, when your CPU-core is reading "array", then "array+8", then "array+16", you can be pretty damn sure the next thing it wants to read is "array+24", so you prefetch that. There's no need to wait for the CPU to actually issue the command for "array+24", you fetch it even before the code executes.

Now if you have "0x8009230", which points to "0x81105534", which points to "0x92FB220", good luck prefetching that sequence.

--------

Which is why servers use SMT / hyperthreading, so that the core can "switch" to another thread while waiting those 50-nanoseconds / 200-cycles or so.

carlhjerpe · on Oct 18, 2021

I don't really know how the implementation of a tracing GC works but I was thinking they could do some smart memory ordering to land in the same cache-line as often as possible.

Thanks for the clarifications :)

monocasa · on Oct 18, 2021

Interestingly earlyish smalltalk VMs used to keep the object headers in a separate contiguous table.

Part of the problem though, is that the object graph walk pretty quickly is non contiguous, regardless of how it's laid out in memory.

kaba0 · on Oct 19, 2021

But that’s just the marking phase, isn’t it? And most of it can be done fully in parallel, so while not all CPU cores can be maxed out with that, more often than not the original problem itself can be hard to parallelize to that level, so “wasting” a single core may very well be worth it.

rocqua · on Oct 19, 2021

You prefetch by having node->next as close as possible to node. You do that by using an allocator that tries very hard to ensure this.

Doesn't work that well for GC but for specific workloads it can work very nicely.

dragontamer · on Oct 19, 2021

Yeah, that's a fine point.

I always like pointing out Knuth's dancing links algorithm for Exact-covering problems. All "links" in that algorithm are of the form "1 -> 2 -> 3 -> 4 -> 5" at algorithm start.

Then, as the algorithm "guesses" particular coverings, it turns into "1->3->4->5", or "1->4", that is, always monotonically increasing.

As such, no dynamic memory is needed ever. The linked-list is "statically" allocated at the start of the program, and always traversed in memory order.

Indeed, Knuth designed the scheme as "imagine doing malloc/free" to remove each link, but then later, "free/malloc" to undo the previous steps (because in Exact-covering backtracking, you'll try something, realize its a dead end, and need to backtrack). Instead of a malloc followed up by a later free, you "just" drop the node out of the linked list, and later reinsert it. So the malloc/free is completely redundant.

In particular: a given "guess" into an exact-covering problem can only "undo" its backtracking to the full problem scope. From there, each "guess" only removes possibilities. So you use the "maximum" amount of memory at program start, you "free" (but not really) nodes each time you try a guess, and then you "reinsert" those nodes to backtrack to the original scope of the problem.

Finally, when you realize that, you might as well put them all into order for not only simplicity, but also for speed on modern computers (prefetching and all that jazz).

Its a very specific situation but... it does happen sometimes.