--- I accidentally deleted this comment, so, I've re-written it. ---
Disclaimer: I'm a HPC system administrator in a relatively big academic supercomputer center. I also develop scientific applications to run on these clusters.
> Linus is mostly wrong except for HPC. Very few dev pipelines for folks result in native executables. The vast majority of code is delivered as either source (python, ruby, etc) or bytcode, JVM, Scalia, etc.
Scientific applications targeted for HPC environments contain the most hardcore CPU optimizations. They are compiled according to CPU architecture and the code inside is duplicated and optimized for different processor families in some cases. Python is run with PyPy with optimized C bindings, JVM is generally used in UI or some very old applications. Scala is generally used in industrial applications.
> And the Xeon class machines folks deploy to in data center envs is a world apart from their MacBooks.
No, they don't. Xeon servers generally have more memory bandwidth, and more resiliency checks (ECC, platform checks, etc.). Considering the MacBook Pro have a same-generation CPU with your Xeon server with a relatively close frequency, per core performance will be very similar. There won't be special instructions, frequency enhancing gimmicks, or different instruction latencies. If you optimize well, you can get the same server performance from your laptop. Your server will scale better, and will be much more resilient in the end, but the differences end there.
> Even those creating native binaries, this is done through ci/cd pipelines.
Cross compilation is a nice black box which can add behavioral differences to your code which you cannot test in-house. Especially if you're doing leading/cutting edge optimizations in the source code level.
Isn't turbo boost an issue when comparing/profiling? my experience with video generation/encoding run of about 30 sec was that my macbook outperformed the server xeons... if left to cool down for a few minutes between test runs. otherwise a testrun of 30 seconds would suddenly jump up to over a minute.
the xeons though always took about 40 seconds.. but were consistent in that runtime (and were able to do more of the same runs in parallel without loosing performance)
> Isn't turbo boost an issue when comparing/profiling?
No. In HPC world, profiling is not always done over "timing". Instead, tools like perf are used to see CPU saturation, instruction hit/retire/miss ratios. Same for cache hits and misses. For more detailed analysis, tools like Intel Parallel Studio or its open source equivalents are used. Timings are also used, but for scaling and "feasibility" tests to test whether the runtime is acceptable for that kind of job.
OTOH, In a healthy system room environment, server's cooling system and system room temperature should keep the server's temperature stable. This means your timings shouldn't deviate too much. If lots of cores are idle, you can expect a lot of turbo boost. For higher core utilization, you should expect no turbo boost, but no throttling. If timings start to deviate too much, Intel's powertop can help.
> my experience with video generation/encoding run of about 30 sec was that my macbook outperformed the server xeons...
If the CPUs are from the same family, and speed are comparable, your servers may have turbo boost disabled.
> otherwise a testrun of 30 seconds would suddenly jump up to over a minute.
This seems like thermal throttling due to overheating.
> the xeons though always took about 40 seconds.. but were consistent in that runtime (and were able to do more of the same runs in parallel without loosing performance)
Servers' have many options for fine tuning CPU frequency response and limits. The servers may have turbo boost disabled, or if you saturate all the cores, turbo boost is also disabled due to in-package thermal budget.
If you have any more questions, I'd do my best to answer.
I am not sure we are disagreeing on much, but the 4 core i7 in my dev MacBook is a whole lot different than the dual socket, 56 core machines we run on.
Optimizations that need to happen, don’t happen locally, they get tuned on a node in the cluster. Look at all the work Goto has done on Goto Blas.
We agree on HPC, however I also agree with Linus about non-HPC loads. Software and developers are always more expensive than hardware, but scaling beyond a certain point in hardware (number of servers, or the GPUs you need) drives the hardware and maintenance cost up, hence the difference becomes negligible, or the maintenance becomes unsustainable. This is why everyone is trying to run everything faster with the same power budget. At the end, after a certain point, everyone wants to run native code at the backend to reap the power of the hardware they help. This is why I think Linus is right about ARM. That's not I'm not supporting them, but they need to be able to run some desktops or "daily driver" computers which support development. Java's motto was write once, run everywhere, which was not enough to stop migration to x86. Behavioral uniformity is peace of mind, and is a very big peace TBH.
What I wanted to say is, unless the code you are writing consists of interdependent threads and the minimum thread count is higher than your laptop, you can do 99% of the optimization on your laptop. On the other hand, if the job is single threaded or the threads are independent, the performance you obtain in your laptop per core is very similar to the performance you get on the server.
For BLAS stuff I use Eigen, hence I don't have experience with xBLAS and libFLAME, sorry.
From a hardware perspective, a laptop and a server is not that different. Just some different controllers and resiliency features.
Disclaimer: I'm a HPC system administrator in a relatively big academic supercomputer center. I also develop scientific applications to run on these clusters.
> Linus is mostly wrong except for HPC. Very few dev pipelines for folks result in native executables. The vast majority of code is delivered as either source (python, ruby, etc) or bytcode, JVM, Scalia, etc.
Scientific applications targeted for HPC environments contain the most hardcore CPU optimizations. They are compiled according to CPU architecture and the code inside is duplicated and optimized for different processor families in some cases. Python is run with PyPy with optimized C bindings, JVM is generally used in UI or some very old applications. Scala is generally used in industrial applications.
> And the Xeon class machines folks deploy to in data center envs is a world apart from their MacBooks.
No, they don't. Xeon servers generally have more memory bandwidth, and more resiliency checks (ECC, platform checks, etc.). Considering the MacBook Pro have a same-generation CPU with your Xeon server with a relatively close frequency, per core performance will be very similar. There won't be special instructions, frequency enhancing gimmicks, or different instruction latencies. If you optimize well, you can get the same server performance from your laptop. Your server will scale better, and will be much more resilient in the end, but the differences end there.
> Even those creating native binaries, this is done through ci/cd pipelines.
Cross compilation is a nice black box which can add behavioral differences to your code which you cannot test in-house. Especially if you're doing leading/cutting edge optimizations in the source code level.