Ah, but you've missed two key advantages of CPUs: 1. Superscalar (multiple instr...

Ah, but you've missed two key advantages of CPUs:

1. Superscalar (multiple instructions in one clock tick)

2. SIMD (not as good as GPU, but still substantial)

Ryzen 9 7950X (Zen4 cores, 16-cores) has AVX512. Its only 256-bit wide, but there are 4 pipelines (!!). (Actually, 8 pipelines, but only 4 of them are SIMD. Its complicated, lets just assume 4)

So each pipeline can handle 8x32 bit integers per clock, and up to 4 pipelines can be active at once. I do believe Zen4 allows for 4-pipelines to all be performing the classic "MAD / Multiply-and-accumulate" instruction (very important for padding those matrix multiplication TFLOPS numbers), though how much you want to count this is... rather difficult because the 4x pipelines don't all have the same capabilities.

-----------

I can't say I've tested this out myself. But the Ryzen 9 7950X has a theoretical throughput of 16 cores x 8 ops per pipeline x 4 pipelines == 512 x 32-bit ops per clock tick, and is therefore 4x faster than your "128-core" estimate.

Running this code in practice is difficult... but similar to GPU programming. Intel ispc is a good tool for generating optimized AVX512 in the "GPU programming style". I know that OpenMP has a SIMD-statement, and the autovectorizers on GCC/CLang/LLVM are getting better.

A lot of options, all difficult but "valid"... at least valid for a simple brute-force embarrassingly parallel problem as discussed here. Things get rather complicated rather quickly in this space when we talk about practical applications...

------------

Note: to achieve this in practice requires some very careful coding and magic. You'll need to fit inside of the micro-op cache, you'll need to keep the pipeline perfectly open, etc. etc.