Well, Haswell does have 2x integer and 4x FPU performance. A lot of potential in...

CyberDildonics · on Aug 5, 2015

Be aware though that that 4x FPU number is very misleading, because it only applies if every instruction you are doing is a vectorized fma (fuse multiply add). AVX is massively untapped, but Intel's peak flops numbers based on SIMD fma are only true in the most technical sense.

Also, it is unlikely that you are actually memory bandwidth bound unless you are looping through linear memory on every core. It is fairly tricky to make useful software that is not memory latency bound.

vardump · on Aug 5, 2015

I know that. But I also know FMA is what you're almost always doing with a FPU, in terms of consumed clock cycles.

So I don't really think it's that misleading. It's the go-to instruction.

> Also, it is unlikely that you are actually memory bandwidth bound unless you are looping through linear memory on every core.

Well, that's what Intel vtune says. If I'm processing stuff at total bandwidth of 30 GB/s, I tend to believe it. Each core is dual issue, so with SSE you can process up to 32 bytes per cycle per core. When you have 4 cores at 3.5 GHz, that adds up. The CPU beast can consume an order of magnitude more data than DRAM can supply, even with just SSE. So you can do a lot of operations on the data and still be memory bound.

CyberDildonics · on Aug 5, 2015

> I know that. But I also know FMA is what you're almost always doing with a FPU, in terms of consumed clock cycles.

That is not a true generalization at all. Maybe you are doing image compositing, but fma instructions are not so heavily used that they can be thought of as the single workhorse of a processor.

vardump · on Aug 5, 2015

> but fma instructions are not so heavily used that they can be thought of as the single workhorse of a processor.

Well, indeed, a lot of cycles are lost for moving the data around, branching, waiting for memory. But when it comes to floating point computation, well, FMA is really common. In floating point inner loops it's just so usual to need to do "x := x + a * b".

I didn't say single workhorse of computer, but for DSP type of stuff FMA rocks. Dot product, matmul, even fft -- a lot of FPU heavy computation directly benefits from FMA.

Anyways, even without FMA, you can do a lot of ops per value and still be memory bound.