In the 90's I was an architect on Intel's Willamette (Pentium4) thermal throttle (TT1). TT1 "knocked the teeth" out of clock cycles if the checker-retirement unit (CRU, the hottest part of the die) got too hot. This evolved into TT2/Geyserville (where you move up/down the V/F curve to actively stay under the throttle limit). We were browbeaten by upper management to prove this would not visibly impact performance and worked on one of the MANY MANY software simulators written throughout the company to prove this. (It was actually my favourite job there.) This is when the term "Thermal Design Power" arrived: top marketing brass to avoid using "Max Power" which was far higher. It is possible to have almost a 2x difference between max power (running a "power virus", which intel was terrified of from chipsets, to graphics, to CPUs) and what typical apps use (thermal design power). Performance was a bit dodgy on a few apps, but not a significant compared to run-to-run vairation. (Remember this is 1995-1997 after the half-arsed Pentium fiasco in 1993 when Motorola openly mocked intel for having a 16W CPU... FDIV wasn't thermal fiasco, but it was a proper cock up).
Die are sorted based on something called a bin split: die are binned immediately after wafersort based on their leakage (there are special transistors implanted near the scribe-lines that indicate tons of characteristics, as well as DFX units through out the die that are rings of 20 inverters that oscillate, also indicates tons of data on how the die behave, however testing the those buggered DFX circuits takes an enormous amount of time, and you can't slow down wafersort, so there are proxies).
The bins are designed in such a way to maximize profit and performance based on the die characteristics. Thermal throttle plays a role in this and each bin (among various vectors) is allowed some tolerance, which is exactly what OP has discovered. However, this has been going on for coming up on 30 years! So nothing really new here, I just thought I'd let you know that of course Intel is aware of this, and they never claim performance numbers outside of the tolerance allowed for thermal throttle.
There seems to be little attempt to ensure the ambient environment of the processors is isothermal. That is, there's likely significant chassis to chassis thermal variation that is as big or larger than any difference in thermal budget between processors.
Even things like the characteristics of the application of the thermal paste and the roughness of the individual fan ducts can matter, beyond the obvious bottom of rack vs. top of rack, position of processor in individual chassis, etc effects.
tl;dr-- probably most of what is measured is not silicon to silicon variation.
I have a pair of "identical" GPUs which have dramatically different performance. Swapping the order didn't help. It turns out it was because one was blowing hot air into the other. Sigh.
I have seen this happen before. In the 1990's I was an audio engineer for a radio station that had just spent millions on digital suites with hard drives that would crash regularly, destroying in-progress work.
The station actually got techs to fly to Australia to try to diagnose the problem -- A bunch of SCSI disks stacked vertically was causing the topmost drive to overheat.
Had a similar experience with a JBOD chassis. The vibrations of the spinning drives caused resonance on some positions. The cure was to attach a patch of duct tape with a washer at a certain point on the chassis.
My previous PC had the option of inserting harddisks in rubber bands instead of screwing them to the frame. It was a case specifically aimed at noise reduction (the Antec Solo).
That sounds like a really easy way to make a hard drive overheat.
Way back, I had a 4GB SCSI hard drive that was extra tall, extra fast, extra noisy, and extra hot. I constructed a thick rubber box around it to try noise-proof it a little, but I also had to strap a CPU cooler to it and have the airflow enter the box and exit at a hole in the box after wrapping all the way round the hard drive. It worked. But just wrapping it in rubber would have been a very quick way to cook the drive. This is a drive that if just left running bare on a table would get too hot to touch.
Hard drives produce a heck of a lot less heat these days.
An elastic mount does not have to be a suffocating box.
Plus, and I could be wrong here - the mounts are not really designed to be heat conductors (the rails are plastic on my high end Dell workstation); I suppose the heat is extracted via air flow.
I laughed so hard at this and then realized this was probably the source of my SLI woes way back in the day.
But for real, I recently replaced the thermal paste on four 10 year old servers and performance improved by 20%!! A full core of performance by ensuring proper heat transfer
AMD gave us a dual socket test system (back in the Phenom days) which started failing after a couple of years of very heavy use for builds, CI, etc. After much head scratching I eventually found that the thermal paste had dried out, and simply reattaching the CPU fans with new paste fixed the system and it lived for quite a few more years after that.
That is excellent advice. I've had machines that started out pretty quiet but got noisier as they got older. I figured it was gunk accumulating on the fans or other parts, but I guess checking the paste is a good idea too.
depends on the paste, but safe to say 10 years is probably beyond the duty cycle. its $10 for a tube of arctic silver (I went with Arctic's organic, non conductive one I forget the name) that you can use for 10 or more applications. it takes 5 minutes to replace the thermal paste on a cpu.
remember intel cpus will simply throttle themselves at high temps, so replacing the paste will allow heat dissipation across all the cores (as I witnessed first hand)
That was probablematic in the heyday of all the various Unix vendors. It was pretty common for some machines to vent front to back, others back to front, or large differences in depth, etc. So if you weren't paying attention, you could easily make a multi-stage heater stack in a single rack.
Probably still an issue today, though there's a lot fewer vendors and variation.
Old supercomputers - those built before the days of rows of racks - had such character, though! I worked at PNNL in the late 90s, and while they did have newer supercomputers that had a more traditional rack design, they also had a number of older ones that were obviously designed to be seen as well as used.
Seymour Cray referred to himself as an "overpaid plumber" for good reason. Heat management has been a critical element of high performance computing for a long time
If I recall, one iteration of their products sprayed a shower of cooled liquid Fluorinert[1] directly into the processor cards, with no heatsinks at all. [2]
There's a few SaaS/normal software out there that capitalize on the per chassis/compute unit performance difference amongst identical SKUs.
My employer runs a few HPC grid workloads on AWS and Azure and these softwares would benchmark the instances on boot and drop the ones that were below a certain threshold, and we'd keep spawning instances until the fleet was up past our threshold. (Ultimately, we didn't buy the software).
I wondered (not too hard) why there could be such a difference in performance, but your comment has given some good insight into that.
This particular solution doesn't sound feasible in the long term. If a significant portion of customers start using this, the benefit is nullified. And providers will probably react.
It's very similar to the "download boosters" of old which worked until they got popular which is when download sites started throttling.
Well thought out response, that is my gut feeling as well. I think packaging could have a larger than expected effect too, especially CPUs with a thermal cap.
This kind of in situ measurement might be useful to feed into a job scheduler's weight or something but at the end of the day "environment is different" and maybe you have to derate your cluster's expected performance to account?
Not just that, but also varitions in motherboard (chipset and anxilary chips), memory, timing crystals - many factors that are equally suspectable to micro-variations. WHich cumulativly, stack up.
Then there is that lovely gotcha that unless all your components are from the same batch, you may well see small changes in things like ancillary chips upon motherboards due to chip supplies being from multiple vendors or even vendor variation.
Many variables that widen the margin of error and would of been nice if they at least had one system in which they tested a few CPU's, to at least eliminate those extra variables. Ideally, look at the results they got, get the 5 slowest and 5 fastest CPU's based upon the results and retest those upon one system and compare the results. That would of been great and made the difference.
Why is it that people on this site seem to always believe that they know better than the person who took the time to do the work that is cited? The amount of "armchair expertise" here is sickening.
Do we really think that someone who is looking at this type of thing in enough detail to produce the graphs in the article wouldn't think of ambient temperature differential? REALLY?!
Something not being documented well enough for us does not mean that it wasn't considered... Some thing that we thought of going unmentioned does not indicate that said thing was not considered or tried.
This article doesn't mention power supply equivalency or anything about how many people were present in the datacenter during a given test. Here comes the patented HN 'hot take': "uh well actually they don't mention power supplies - are we sure these devices were even powered on?"
Just further lols for you to consider-- the data shows a large difference with the processors in slot 0 being significantly slower than the processors in slot 1. Clearly silicon variation /s.
> Third, the median processor performance (shown by
the green, blue and red lines) between processor 0 and processor 1
on Sandy Bridge show up to 1% difference. However, that difference
in median processor performance increases to up to 5% on Broadwell
This socket 1 vs socket 2 variability alone explains 1/3rd of the magnitude of the difference measured in the study-- which makes it quite clear we're not just measuring silicon variability.
LOL. Maybe have a coffee or find another way to chill out.
The paper itself is pretty clear with the methodology: the throughput of each processor was measured in an installed supercomputing cluster. There are obviously sources of variation in processor performance in a supercomputing cluster beyond the actual silicon plugged into the socket. The experiment has no ability to control for this variation, and no attempt was made.
The reason why we have research and research papers is so we can read about the methodology and think about the limitations of what is measured. It's an interesting measurement that the researchers made in situ; but it also has obvious limitations.
> We show that this variation is further magnified
under a hardware-enforced power constraint, potentially due to
the increase in number of cores, inconsistencies in the chip manufacturing process and their combined impact on processor’s energy
management functionality
and recommendations:
> • Characterizing node performance based on averaged performance distorts the true impact of manufacturing variation on
processors on the node and therefore should be avoided
Well, yes, they do speculate in the abstract, note potentially there. It looks like authors do believe the variation is caused by manufacturing variation. But nothing in the paper actually shows it. There's no attempt to determine the cause of observed performance variation in the paper. An empirical survey.
Interestingly, that recommendation could be interpreted "empirical studies based on averaged node performance" give a very distorted view on "the true impact of manufacturing variation on processors".
The entire study seems to presuppose that most of what is measured is processor manufacturing variation. There's further recommendations about removing the variation with processor binning, etc.
It's an interesting set of measurements, but the assumed source of the variation is dubious, and it's not clearly what, if any, actions it really supports.
I wonder how much of this is due to measurement error of temperature. Core frequency and voltage control are governed by some suspiciously round numbers like Tj(max) == 90C. But when the controller thinks Tj == 90C, what's the measurement error?
Interesting, but the discussion of "an atom here and there" affecting the performance doesn't make sense. The manufacturing variations are much larger than an atom or two. This variation is part of the motivation for "binning" of processors, testing them and then selling them at different performance levels based on how they turn out.
This isn't new. Processors are often the same template for multiple models and manufacturers "bin" based on quality and just turn off the bad parts. This is why it's more expensive to produce a nicer processor: the yield rates are much lower. There is still significant variation in models, though, known as silicon lottery. This is why some chips overclock or undervolt much better than others. There's even one site that sells chips that basically go through extra binning to ensure a better product: https://siliconlottery.com/
I'm surprised that, even at the same frequency, there is still some pretty large variation. I wonder if that's due to other sources of noise that are often ignored by a lot of people running benchmarks (e.g. background processes, SMM, ME, etc.)
e.g., memory, which presumably have their own temperature characteristics
To my knowledge, and this is based on DDR3 and older; memory frequencies are essentially fixed because the transceivers on both ends need to sample in the middle of a bit cell, and to do that they need to know the clock period, which must not change once it's known. There's a delay-locked-loop (DLL) in the RAM which generates a phase-shifted local reference clock.
If the processors could all be locked to one constant frequency (i.e. all the power/performance "dynamic tuning" stuff disabled) that would help show whether there's other sources of noise. This of course also assumes the clock generators are identical.
Ugh... Had two PCs with the exact same spec (CPU, motherboard, RAM same product number, etc.), all same versions of Linux, firmware. Had 15% in performance difference between the 2 gears. Turned out to be some bios tuning, not really related to performance. 15%...
I would say that if it make 15% difference it _was_ related to performance... It might not have been an option on a screen specifically stated as being performance related options, but that could just be a bad menu layout.
I'd be interested to know what the option was, if you have any specific memory of that?
The variances are likely to be even more noticeable nowadays as modern (x86 at least) processors do a lot more automatic overclocking based on temperatures.
Die are sorted based on something called a bin split: die are binned immediately after wafersort based on their leakage (there are special transistors implanted near the scribe-lines that indicate tons of characteristics, as well as DFX units through out the die that are rings of 20 inverters that oscillate, also indicates tons of data on how the die behave, however testing the those buggered DFX circuits takes an enormous amount of time, and you can't slow down wafersort, so there are proxies).
The bins are designed in such a way to maximize profit and performance based on the die characteristics. Thermal throttle plays a role in this and each bin (among various vectors) is allowed some tolerance, which is exactly what OP has discovered. However, this has been going on for coming up on 30 years! So nothing really new here, I just thought I'd let you know that of course Intel is aware of this, and they never claim performance numbers outside of the tolerance allowed for thermal throttle.