> End result: go look for AVX-512 benchmarks. You'll find them. And then ask you...

fulafel · on Nov 28, 2022

JSON acceleration and handful of other peripheral spot improvements aren't going to yield big whole application speedups in most apps.

Using hand coded accelerated SIMD kernels in specific places like JSON codec hits Amdahls' law[1]. The archievable whole app speedup is going to be low in most cases unless you get pervasive performant compiler generated SIMD code throughout your code, done by JITs of managed languages etc.

[1] https://en.wikipedia.org/wiki/Amdahl%27s_law

paulmd · on Nov 28, 2022

if there were big general-case speedups still possible from auto-vectorization that juice would have been squeezed already, even if it was only possible at runtime the core would watch for those patterns and pull those onto the units.

it's like the old joke about economists: an economist is walking down the street with his friend, the friend says "look, a 20 dollar bill laying on the ground!" and bends to pick it up. But the economist keeps walking, and says "it couldn't be, or someone would already have picked it up!". That joke but with computer architecture.

we are inherently talking about a world where that's not possible anymore, that juice has been squeezed. Throwing 10% more silicon at some specific problems and getting 2.5-3x speedup (real-world numbers from SIMD-JSON) is better than throwing 10% more silicon at the general case and getting 1% more performance. If 10-20% of real-world code gets that 3x speedup (read: probably 2x perf/w) that's great, that's much better than the general-case speedups!

fulafel · on Nov 28, 2022

Depends on what abstraction level we are discussing.

Existing applications without code changes in mainstream languages with parallelism-restrictive semantics, I agree we're probably close to the limits.

Beyond that we know that most apps are amenable to human perf work and reformulation to get very big speedups.

Besides human reformulation expressed with only low level parallelism primitives like SIMD instrinsics, the field has been using parallelism geared languages like Cuda, Futhark, ISPC etc. And there's a lot of untapped potential in data representation flexibility etc that even those languages aren't tapping, like for example Halide can do.

Human perf work also involves a lot of trial and error, it's a search type process, automation of which hasn't been explored that much. Some work to automating this are approaches like Atlas BLAS.

ilyt · on Nov 28, 2022

> ... do you guys not do JSON? cause we're doing json.

Go ahead, benchmark how much of JSON parsing consists of runtime of the entire pipeline between browser and server. I bet it is in low single digit unless you're literally doing "deserialize, change a variable, serialize"

Yes, it is use case that works, but unless you're keeping gigabytes in JSON and analyzing it, it isn't making most users job any faster, low single digits increases at best.

> It's also a huge efficiency win when used in a constrained-power environment - for your 200W-per-1RU power budget,

Which is not the use case Linus was talking about ? He didn't argue it doesn't make sense in those rare cases where you can optimize for it and GPUs are not an option.

chrchang523 · on Nov 28, 2022

We'll see.

There's now another game in town, exemplified by Apple's switch to ARM chips for its Macs. I'll keep an eye on the AVX2 vs. AVX-512 performance gap on Zen4+, but my working hypothesis is that my SIMD-handcoding time will be better spent on improving ARM support than upgrading AVX2 code to AVX-512 for the foreseeable future.

janwas · on Nov 28, 2022

How about porting to Highway, that gets you AVX-512, NEON and SVE(2) from a single rewrite :)

chrchang523 · on Nov 28, 2022

I'd probably use Highway in a new project; thanks for your work on it! In my main existing project, though, Highway-like code already exists as a side effect of supporting 16-byte vectors and AVX2 simultaneously, and I'd also have to give up the buildable-as-C99 property which has occasionally simplified e.g. FFI development.

janwas · on Nov 28, 2022

:) C99 for FFI makes sense. It's pretty common to have a C-like function as the entry point for a SIMD kernel. That means it's feasible to build only the implementation as C++, right?

kristianp · on Nov 29, 2022

https://github.com/google/highway

paulmd · on Nov 28, 2022

I'm a huge simp for M1 too (and there's SVE there too). Yeah for client stuff if you can get people to just buy a macbook that's the best answer right now, if that does their daily tasks. Places need to start thinking about building ARM images anyway, for Ampere and Graviton and other cost-effective server environments if nothing else. If you are that glued at the hip to x86 is time to look at solving this problem.

Apple's p-cores get the limelight but the e-cores are simply ridiculous for their size... they are 0.69mm^2 vs 1.7mm2 for gracemont, excluding cache. Gracemont is Intel 7, so it's a node behind, but, real-world scaling is about 1.5-1.6x between 5nm and 6nm so that works out to about 1.1mm2 for Avalanche if it were 7nm, for equal/better performance to gracemont, at much lower power.

https://www.reddit.com/r/hardware/comments/qlcptr/m1_pro_10c...

Sierra Forest (bunch of nextmont on a server die, like Denverton) looks super interesting and I'd absolutely love to see an Apple equivalent, give me 256 blizzard cores on a chiplet and 512 or 1024 on a package. Or even just an M1 Ultra X-Serve would be fantastic (although the large GPU does go unutilized). But I don't think Apple wants to get into that market so far from what I've seen.

(tangent but everyone says "Gracemont is optimized for size not efficiency!" and I don't know what that means in a practical sense. High-density cell libraries are both smaller and more efficient. So if people meant that they were using high-performance libraries that would be both bigger and less efficient (but clock higher). If it's high density it'd be smaller and more efficient but clock lower. Those two things go together. And yes everyone uses a mix of different types of cells, with high-performance cells on the timing hot-path... but "gracemont is optimized for size not efficiency" has become this meme that everyone chants and I don't know what that actually is supposed to mean. If anyone knows what that's supposed to be, please do tell.)

(also, as you can see from the size comparison... despite the "it's optimized for size" meme, gracemont still isn't really small, not like Blizzard is small. they're using ~50% more transistors to get to the same place, and it's almost half the size of a full zen3 core with SMT and all the bells and whistles... I really think e-cores are where the music stops with the x86 party, I think i-cache and decoders are fine on the big cores but as you scale downwards they start taking up a larger and larger portion of the core area that remains... it is Amdahl's Law in action with area, if i-cache and decoding doesn't scale then reducing the core increases the fraction devoted to i-cache/decoding. And if you scale it down then you pay more penalty for x86-ness in other places, like having to run the decoder. And you have to run the i-cache at all times even when the chip is idling, otherwise you are decoding a lot more. It just is a lot of power overhead for the things you use an e-core for.)