--- I accidentally deleted this comment, so, I've re-written it. --- Disclaimer:...

unilynx · on Feb 22, 2019

Isn't turbo boost an issue when comparing/profiling? my experience with video generation/encoding run of about 30 sec was that my macbook outperformed the server xeons... if left to cool down for a few minutes between test runs. otherwise a testrun of 30 seconds would suddenly jump up to over a minute.

the xeons though always took about 40 seconds.. but were consistent in that runtime (and were able to do more of the same runs in parallel without loosing performance)

always attributed that to the turboboost..

bayindirh · on Feb 22, 2019

> Isn't turbo boost an issue when comparing/profiling?

No. In HPC world, profiling is not always done over "timing". Instead, tools like perf are used to see CPU saturation, instruction hit/retire/miss ratios. Same for cache hits and misses. For more detailed analysis, tools like Intel Parallel Studio or its open source equivalents are used. Timings are also used, but for scaling and "feasibility" tests to test whether the runtime is acceptable for that kind of job.

OTOH, In a healthy system room environment, server's cooling system and system room temperature should keep the server's temperature stable. This means your timings shouldn't deviate too much. If lots of cores are idle, you can expect a lot of turbo boost. For higher core utilization, you should expect no turbo boost, but no throttling. If timings start to deviate too much, Intel's powertop can help.

> my experience with video generation/encoding run of about 30 sec was that my macbook outperformed the server xeons...

If the CPUs are from the same family, and speed are comparable, your servers may have turbo boost disabled.

> otherwise a testrun of 30 seconds would suddenly jump up to over a minute.

This seems like thermal throttling due to overheating.

> the xeons though always took about 40 seconds.. but were consistent in that runtime (and were able to do more of the same runs in parallel without loosing performance)

Servers' have many options for fine tuning CPU frequency response and limits. The servers may have turbo boost disabled, or if you saturate all the cores, turbo boost is also disabled due to in-package thermal budget.

If you have any more questions, I'd do my best to answer.

plasticchris · on Feb 22, 2019

Sounds like thermal throttling. I don't think there's any reason turbo can't be continuous if thermals are under control, see https://en.wikichip.org/wiki/intel/frequency_behavior

sitkack · on Feb 24, 2019

I am not sure we are disagreeing on much, but the 4 core i7 in my dev MacBook is a whole lot different than the dual socket, 56 core machines we run on.

Optimizations that need to happen, don’t happen locally, they get tuned on a node in the cluster. Look at all the work Goto has done on Goto Blas.

bayindirh · on Feb 24, 2019

We agree on HPC, however I also agree with Linus about non-HPC loads. Software and developers are always more expensive than hardware, but scaling beyond a certain point in hardware (number of servers, or the GPUs you need) drives the hardware and maintenance cost up, hence the difference becomes negligible, or the maintenance becomes unsustainable. This is why everyone is trying to run everything faster with the same power budget. At the end, after a certain point, everyone wants to run native code at the backend to reap the power of the hardware they help. This is why I think Linus is right about ARM. That's not I'm not supporting them, but they need to be able to run some desktops or "daily driver" computers which support development. Java's motto was write once, run everywhere, which was not enough to stop migration to x86. Behavioral uniformity is peace of mind, and is a very big peace TBH.

What I wanted to say is, unless the code you are writing consists of interdependent threads and the minimum thread count is higher than your laptop, you can do 99% of the optimization on your laptop. On the other hand, if the job is single threaded or the threads are independent, the performance you obtain in your laptop per core is very similar to the performance you get on the server.

For BLAS stuff I use Eigen, hence I don't have experience with xBLAS and libFLAME, sorry.

From a hardware perspective, a laptop and a server is not that different. Just some different controllers and resiliency features.