Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There seems to be little attempt to ensure the ambient environment of the processors is isothermal. That is, there's likely significant chassis to chassis thermal variation that is as big or larger than any difference in thermal budget between processors.

Even things like the characteristics of the application of the thermal paste and the roughness of the individual fan ducts can matter, beyond the obvious bottom of rack vs. top of rack, position of processor in individual chassis, etc effects.

tl;dr-- probably most of what is measured is not silicon to silicon variation.



I have a pair of "identical" GPUs which have dramatically different performance. Swapping the order didn't help. It turns out it was because one was blowing hot air into the other. Sigh.


I have seen this happen before. In the 1990's I was an audio engineer for a radio station that had just spent millions on digital suites with hard drives that would crash regularly, destroying in-progress work.

The station actually got techs to fly to Australia to try to diagnose the problem -- A bunch of SCSI disks stacked vertically was causing the topmost drive to overheat.


Had a similar experience with a JBOD chassis. The vibrations of the spinning drives caused resonance on some positions. The cure was to attach a patch of duct tape with a washer at a certain point on the chassis.




Heh I remember in the late 90s I put some rubber between the hdds and case to greatly reduce the noise of my desktop PC.


My previous PC had the option of inserting harddisks in rubber bands instead of screwing them to the frame. It was a case specifically aimed at noise reduction (the Antec Solo).


I have seen rubber 'sleeves' for hdds


I don't understand why this isn't a standard practice.


That sounds like a really easy way to make a hard drive overheat.

Way back, I had a 4GB SCSI hard drive that was extra tall, extra fast, extra noisy, and extra hot. I constructed a thick rubber box around it to try noise-proof it a little, but I also had to strap a CPU cooler to it and have the airflow enter the box and exit at a hole in the box after wrapping all the way round the hard drive. It worked. But just wrapping it in rubber would have been a very quick way to cook the drive. This is a drive that if just left running bare on a table would get too hot to touch.

Hard drives produce a heck of a lot less heat these days.


An elastic mount does not have to be a suffocating box.

Plus, and I could be wrong here - the mounts are not really designed to be heat conductors (the rails are plastic on my high end Dell workstation); I suppose the heat is extracted via air flow.


I laughed so hard at this and then realized this was probably the source of my SLI woes way back in the day.

But for real, I recently replaced the thermal paste on four 10 year old servers and performance improved by 20%!! A full core of performance by ensuring proper heat transfer


Does thermal paste age well? I wonder how much if that was bad initial application vs the ravages if time.


AMD gave us a dual socket test system (back in the Phenom days) which started failing after a couple of years of very heavy use for builds, CI, etc. After much head scratching I eventually found that the thermal paste had dried out, and simply reattaching the CPU fans with new paste fixed the system and it lived for quite a few more years after that.


That is excellent advice. I've had machines that started out pretty quiet but got noisier as they got older. I figured it was gunk accumulating on the fans or other parts, but I guess checking the paste is a good idea too.


It probably depends on the type of paste, but I've found completely dried out paste in old systems that would just fall into pieces when you touch it.


depends on the paste, but safe to say 10 years is probably beyond the duty cycle. its $10 for a tube of arctic silver (I went with Arctic's organic, non conductive one I forget the name) that you can use for 10 or more applications. it takes 5 minutes to replace the thermal paste on a cpu.

remember intel cpus will simply throttle themselves at high temps, so replacing the paste will allow heat dissipation across all the cores (as I witnessed first hand)


That was probablematic in the heyday of all the various Unix vendors. It was pretty common for some machines to vent front to back, others back to front, or large differences in depth, etc. So if you weren't paying attention, you could easily make a multi-stage heater stack in a single rack.

Probably still an issue today, though there's a lot fewer vendors and variation.


Old supercomputers - those built before the days of rows of racks - had such character, though! I worked at PNNL in the late 90s, and while they did have newer supercomputers that had a more traditional rack design, they also had a number of older ones that were obviously designed to be seen as well as used.


Damn. When I trip up on that kind of thing, I always just wonder at how absurd it is.

Two modern miracles. Insane complex and capable tech.

Simple. Hot. Air. Basics.

Oh well, maybe laugh. I would.


Seymour Cray referred to himself as an "overpaid plumber" for good reason. Heat management has been a critical element of high performance computing for a long time


If I recall, one iteration of their products sprayed a shower of cooled liquid Fluorinert[1] directly into the processor cards, with no heatsinks at all. [2]

[1] https://en.m.wikipedia.org/wiki/Fluorinert [2] https://en.m.wikipedia.org/wiki/Cray-2


It was fully immersed! The Fluorinert was circulated through the stacked cards, then re-chilled with a cold water heat exchanger.


There's a few SaaS/normal software out there that capitalize on the per chassis/compute unit performance difference amongst identical SKUs.

My employer runs a few HPC grid workloads on AWS and Azure and these softwares would benchmark the instances on boot and drop the ones that were below a certain threshold, and we'd keep spawning instances until the fleet was up past our threshold. (Ultimately, we didn't buy the software).

I wondered (not too hard) why there could be such a difference in performance, but your comment has given some good insight into that.


This particular solution doesn't sound feasible in the long term. If a significant portion of customers start using this, the benefit is nullified. And providers will probably react.

It's very similar to the "download boosters" of old which worked until they got popular which is when download sites started throttling.


don't think this would work on GCP since they regularly preform 'live migrations' https://cloud.google.com/compute/docs/instances/live-migrati...


I think they don't live-migrate pre-emptible instances: https://cloud.google.com/compute/docs/instances/live-migrati...


They just preempt them when they need to migrate.


Yes. So at least you explicitly know when you are getting new physical computers, so you can run your selection process again.


Well thought out response, that is my gut feeling as well. I think packaging could have a larger than expected effect too, especially CPUs with a thermal cap.

This kind of in situ measurement might be useful to feed into a job scheduler's weight or something but at the end of the day "environment is different" and maybe you have to derate your cluster's expected performance to account?


Not just that, but also varitions in motherboard (chipset and anxilary chips), memory, timing crystals - many factors that are equally suspectable to micro-variations. WHich cumulativly, stack up.

Then there is that lovely gotcha that unless all your components are from the same batch, you may well see small changes in things like ancillary chips upon motherboards due to chip supplies being from multiple vendors or even vendor variation.

Many variables that widen the margin of error and would of been nice if they at least had one system in which they tested a few CPU's, to at least eliminate those extra variables. Ideally, look at the results they got, get the 5 slowest and 5 fastest CPU's based upon the results and retest those upon one system and compare the results. That would of been great and made the difference.


Why is it that people on this site seem to always believe that they know better than the person who took the time to do the work that is cited? The amount of "armchair expertise" here is sickening.

Do we really think that someone who is looking at this type of thing in enough detail to produce the graphs in the article wouldn't think of ambient temperature differential? REALLY?!

Something not being documented well enough for us does not mean that it wasn't considered... Some thing that we thought of going unmentioned does not indicate that said thing was not considered or tried.

This article doesn't mention power supply equivalency or anything about how many people were present in the datacenter during a given test. Here comes the patented HN 'hot take': "uh well actually they don't mention power supplies - are we sure these devices were even powered on?"


Just further lols for you to consider-- the data shows a large difference with the processors in slot 0 being significantly slower than the processors in slot 1. Clearly silicon variation /s.

> Third, the median processor performance (shown by the green, blue and red lines) between processor 0 and processor 1 on Sandy Bridge show up to 1% difference. However, that difference in median processor performance increases to up to 5% on Broadwell

This socket 1 vs socket 2 variability alone explains 1/3rd of the magnitude of the difference measured in the study-- which makes it quite clear we're not just measuring silicon variability.


LOL. Maybe have a coffee or find another way to chill out.

The paper itself is pretty clear with the methodology: the throughput of each processor was measured in an installed supercomputing cluster. There are obviously sources of variation in processor performance in a supercomputing cluster beyond the actual silicon plugged into the socket. The experiment has no ability to control for this variation, and no attempt was made.

The reason why we have research and research papers is so we can read about the methodology and think about the limitations of what is measured. It's an interesting measurement that the researchers made in situ; but it also has obvious limitations.


The cited paper doesn't attribute performance variation to silicon to silicon variation. Only the blog post does.


From the abstract:

> We show that this variation is further magnified under a hardware-enforced power constraint, potentially due to the increase in number of cores, inconsistencies in the chip manufacturing process and their combined impact on processor’s energy management functionality

and recommendations:

> • Characterizing node performance based on averaged performance distorts the true impact of manufacturing variation on processors on the node and therefore should be avoided


Well, yes, they do speculate in the abstract, note potentially there. It looks like authors do believe the variation is caused by manufacturing variation. But nothing in the paper actually shows it. There's no attempt to determine the cause of observed performance variation in the paper. An empirical survey.

Interestingly, that recommendation could be interpreted "empirical studies based on averaged node performance" give a very distorted view on "the true impact of manufacturing variation on processors".


The entire study seems to presuppose that most of what is measured is processor manufacturing variation. There's further recommendations about removing the variation with processor binning, etc.

It's an interesting set of measurements, but the assumed source of the variation is dubious, and it's not clearly what, if any, actions it really supports.


Yeah. Maybe authors know something we don't. Or maybe were simply trying to get the paper accepted into a silicon-related conference.


I wonder how much of this is due to measurement error of temperature. Core frequency and voltage control are governed by some suspiciously round numbers like Tj(max) == 90C. But when the controller thinks Tj == 90C, what's the measurement error?


A moderate amount. Some of the frequency curve involves the thermal diode, but much less so when TDP capping is used as in the paper.

But the temperature of the silicon itself has a lot to do with the performance you get at a given power level.


Power limits have the same problem, don't they? Except it's maybe worse because you get the product of the error terms for Icc and Vcc?


Thermal diodes on dies have a reasonable error.

Voltages and currents are easy to measure relatively precisely.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: