Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> And IIRC, each Intel SMT (hyper-thread) unit has its on instruction pointer and (non-SIMD?) register set.

I believe the SMTs share the register rename storage, which I'd say is the register set more than the 'architectural registers'

But I'd say the reason they're not real is because they don't increase the maximum instructions per clock. With many loads, they do increase the average instructions per clock, but they don't let you do anymore work if you've got a fully tuned load that uses all the computing resources.



In reality you never have a finely tuned load that uses all the resources all of the time. Processors spend a lot of time waiting for memory, so hyperthreading allows for better utilization of compute. It’s rare to have a workload that benefits from turning it off, and in those cases it’s usually because the hyperthread is hurting the cache hit rate enough to offset the gains.


I recently checked whether an ugly hack I implemented years ago for 30% performance gains ("halve the default OpenMP thread count at start of computation and restore it afterwards" -> effectively, disable hyperthreading) was still necessary. It's now apparently a 3x performance gain... All because I'm saturating memory bandwidth. I don't know whether to be happy or sad...


Ouch, looks like they’re clobbering each others cache. Also, well done. It looks like you optimized the hell out of your code.


Both threads are waiting on the same memory pipeline. The actual performance has always lagged behind the theory


If one thread has to hit main memory but the other one can get what it needs from L1 cache they aren't competing for the same resources.


But not the same memory operations. Unless your code fully saturates the memory bandwidth, which is rare, you get some gains here.


It is almost never about saturating the memory bandwidth, but waiting to load instructions and data from memory. That wait time counted in computer cycles is huge.


Yes, that’s precisely why hyperthreading is such a good deal.


“Such” a good deal is 2-30% in benchmarks (and they don’t say if that’s with cache leak protections turned on or off). In previous generations it was more like -20-20%. If one thread is having issues with L1 and L2 cache, splitting that evenly with a completely different workflow isn’t going to help.

If cache contention weren’t a problem, and it was just a matter of jumping into previously unseen instructions and data (cold cache), you’d expect to see 50-300% numbers from hyperthreading, precisely because of how long the stalls are.


Many years ago a customer on a single core computer had problems of heavy video stutter in my product. It was multithreaded and ran as I recall 5 threads (GUI, video decoding, 3d pipeline, device control and computation). Others did not have this problem. After investigation it turned out that said customer had hyperthreading disabled in BIOS. Enabling it fixed the problem instantly.

So "real" or not but from my experience HT does work to the benefit.


I don’t know how many hardware revisions Intel went through where every benchmark said hyperthreading was slower than turning it off, but it was a lot. It became Lucy’s football at some point.

And in these days of post-Dennard scaling, you have thermal throttling, so idle cycles aren’t actually idle, they’re allowing the heat sink to catch up with heat production.


I agree with this stance. It's the same as branch prediction, out of order execution, or anything else about computing efficiently.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: