Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> End result: go look for AVX-512 benchmarks. You'll find them. And then ask yourself: how many of these are relevant to what I bought my PC/mac/phone for?

... do you guys not do JSON? cause we're doing json.

https://www.phoronix.com/review/simdjson-avx-512

or how about ARM emulation? anybody doing ARM development on x86? do you guys have phones, maybe?

https://www.reddit.com/r/emulation/comments/lzfpz5/what_are_...

seriously, linus is wrong on this issue, and he just keeps digging. lose the ego and admit you were spewing FUD.

avx-512 had legit problems in Skylake-SP server environments where it was running alongside non-AVX code (although you could segment servers by AVX and non-AVX if you wanted).

None of that has been true for any subsequent generations where the downclocking is essentially nil, and the offset has always been configurable for enthusiasts who get more buttons to push.

It also was never a "one single instruction triggers latency/downclocking" like some people think, it always took a critical mass of AVX-512 instructions (pulling down the voltage rail) before it paused and triggered downclocking, and "lighter" operations that did less work would trigger this less.

https://travisdowns.github.io/blog/2020/08/19/icl-avx512-fre...

AMD putting AVX-512 in Zen4 sealed the deal. AMD isn't making a billion-dollar bet on AVX-512 because of "benchmarksmanship", they're doing it because it's a massive usability win and a huge win for performance. Over time you will see more AVX-512 adoption than AVX2 - because it adds a bunch of usability stuff that makes it even more broadly applicable, like op-masking and scatter operations, and AVX is already very broadly applicable. Some games won't even launch without it anymore because they use it and they didn't bother to write a non-AVX fallback.

It's also a huge efficiency win when used in a constrained-power environment - for your 200W-per-1RU power budget, you get more performance using AVX-512 ops than scalar ones. Or you can run at a lower power budget for a fixed performance target. And you're seeing that on Zen4, when it's not tied to Intel's 14nm bullshit (and really it's probably also apparent on Ice Lake-SP if anybody bothered to benchmark that). That's also a huge win in laptops too, where watts are battery life. Would anybody like more efficiency when parsing JSON web service responses when they're on battery? I would.

Sorry Linus it's over, just admit you're wrong and move on.



JSON acceleration and handful of other peripheral spot improvements aren't going to yield big whole application speedups in most apps.

Using hand coded accelerated SIMD kernels in specific places like JSON codec hits Amdahls' law[1]. The archievable whole app speedup is going to be low in most cases unless you get pervasive performant compiler generated SIMD code throughout your code, done by JITs of managed languages etc.

[1] https://en.wikipedia.org/wiki/Amdahl%27s_law


if there were big general-case speedups still possible from auto-vectorization that juice would have been squeezed already, even if it was only possible at runtime the core would watch for those patterns and pull those onto the units.

it's like the old joke about economists: an economist is walking down the street with his friend, the friend says "look, a 20 dollar bill laying on the ground!" and bends to pick it up. But the economist keeps walking, and says "it couldn't be, or someone would already have picked it up!". That joke but with computer architecture.

we are inherently talking about a world where that's not possible anymore, that juice has been squeezed. Throwing 10% more silicon at some specific problems and getting 2.5-3x speedup (real-world numbers from SIMD-JSON) is better than throwing 10% more silicon at the general case and getting 1% more performance. If 10-20% of real-world code gets that 3x speedup (read: probably 2x perf/w) that's great, that's much better than the general-case speedups!


Depends on what abstraction level we are discussing.

Existing applications without code changes in mainstream languages with parallelism-restrictive semantics, I agree we're probably close to the limits.

Beyond that we know that most apps are amenable to human perf work and reformulation to get very big speedups.

Besides human reformulation expressed with only low level parallelism primitives like SIMD instrinsics, the field has been using parallelism geared languages like Cuda, Futhark, ISPC etc. And there's a lot of untapped potential in data representation flexibility etc that even those languages aren't tapping, like for example Halide can do.

Human perf work also involves a lot of trial and error, it's a search type process, automation of which hasn't been explored that much. Some work to automating this are approaches like Atlas BLAS.


> ... do you guys not do JSON? cause we're doing json.

Go ahead, benchmark how much of JSON parsing consists of runtime of the entire pipeline between browser and server. I bet it is in low single digit unless you're literally doing "deserialize, change a variable, serialize"

Yes, it is use case that works, but unless you're keeping gigabytes in JSON and analyzing it, it isn't making most users job any faster, low single digits increases at best.

> It's also a huge efficiency win when used in a constrained-power environment - for your 200W-per-1RU power budget,

Which is not the use case Linus was talking about ? He didn't argue it doesn't make sense in those rare cases where you can optimize for it and GPUs are not an option.


We'll see.

There's now another game in town, exemplified by Apple's switch to ARM chips for its Macs. I'll keep an eye on the AVX2 vs. AVX-512 performance gap on Zen4+, but my working hypothesis is that my SIMD-handcoding time will be better spent on improving ARM support than upgrading AVX2 code to AVX-512 for the foreseeable future.


How about porting to Highway, that gets you AVX-512, NEON and SVE(2) from a single rewrite :)


I'd probably use Highway in a new project; thanks for your work on it! In my main existing project, though, Highway-like code already exists as a side effect of supporting 16-byte vectors and AVX2 simultaneously, and I'd also have to give up the buildable-as-C99 property which has occasionally simplified e.g. FFI development.


:) C99 for FFI makes sense. It's pretty common to have a C-like function as the entry point for a SIMD kernel. That means it's feasible to build only the implementation as C++, right?



I'm a huge simp for M1 too (and there's SVE there too). Yeah for client stuff if you can get people to just buy a macbook that's the best answer right now, if that does their daily tasks. Places need to start thinking about building ARM images anyway, for Ampere and Graviton and other cost-effective server environments if nothing else. If you are that glued at the hip to x86 is time to look at solving this problem.

Apple's p-cores get the limelight but the e-cores are simply ridiculous for their size... they are 0.69mm^2 vs 1.7mm2 for gracemont, excluding cache. Gracemont is Intel 7, so it's a node behind, but, real-world scaling is about 1.5-1.6x between 5nm and 6nm so that works out to about 1.1mm2 for Avalanche if it were 7nm, for equal/better performance to gracemont, at much lower power.

https://www.reddit.com/r/hardware/comments/qlcptr/m1_pro_10c...

Sierra Forest (bunch of nextmont on a server die, like Denverton) looks super interesting and I'd absolutely love to see an Apple equivalent, give me 256 blizzard cores on a chiplet and 512 or 1024 on a package. Or even just an M1 Ultra X-Serve would be fantastic (although the large GPU does go unutilized). But I don't think Apple wants to get into that market so far from what I've seen.

(tangent but everyone says "Gracemont is optimized for size not efficiency!" and I don't know what that means in a practical sense. High-density cell libraries are both smaller and more efficient. So if people meant that they were using high-performance libraries that would be both bigger and less efficient (but clock higher). If it's high density it'd be smaller and more efficient but clock lower. Those two things go together. And yes everyone uses a mix of different types of cells, with high-performance cells on the timing hot-path... but "gracemont is optimized for size not efficiency" has become this meme that everyone chants and I don't know what that actually is supposed to mean. If anyone knows what that's supposed to be, please do tell.)

(also, as you can see from the size comparison... despite the "it's optimized for size" meme, gracemont still isn't really small, not like Blizzard is small. they're using ~50% more transistors to get to the same place, and it's almost half the size of a full zen3 core with SMT and all the bells and whistles... I really think e-cores are where the music stops with the x86 party, I think i-cache and decoders are fine on the big cores but as you scale downwards they start taking up a larger and larger portion of the core area that remains... it is Amdahl's Law in action with area, if i-cache and decoding doesn't scale then reducing the core increases the fraction devoted to i-cache/decoding. And if you scale it down then you pay more penalty for x86-ness in other places, like having to run the decoder. And you have to run the i-cache at all times even when the chip is idling, otherwise you are decoding a lot more. It just is a lot of power overhead for the things you use an e-core for.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: