Why is a large speedup from vectors surprising? Considering that the energy required for scheduling/dispatching an instruction on OoO cores dwarfs that of the actual operation (add/mul etc), amortizing over multiple elements (=SIMD) is an obvious win.
My question is whether Intel investing in AVX-512 is wise, given that:
-Most existing code is not aware of AVX anyway;
-Developers are especially wary of AVX-512, since they expect it to be discontinued soon.
Consequently, wouldn't Intel be better off by using the silicon dedicated to AVX-512 to speed up instruction patterns that are actually used?
AVX-512 is not going to be discontinued. Intel's reticence/struggling with having it on desktop is irritating but it's here to stay on servers for a long time.
Writing code for a specific SIMD instruction set is non-trivial, but most code will get some benefit by being compiled for the right ISA. You don't get the really fancy instructions because the pattern matching in the compiler isn't very intelligent but quite a lot of stuff is going to benefit by magic.
Even without cutting people without some AVX off, you can have a fast/slow path fairly easily.
My point is that vector instructions are fundamentally necessary and thus "what does it signal" evaluates to "nothing surprising".
Sure, REP STOSB/MOVSB make for a very compact memset/memcpy, but their performance varies depending on CPU feature flags, so you're going to want multiple codepaths anyway.
And vector instructions are vastly more flexible than just those two.
Also, I have not met developers who expect AVX-512 to be discontinued (the regrettable ADL situation notwithstanding; that's not a server CPU). AMD is actually adding AVX-512.
Anyone using software that benefits from vector instructions. That includes a variety of compression, search, and image processing algorithms. Your JPEG decompression library might be using SSE2 or Neon. All high-end processors have included some form of vector instruction for like 20+ years now. Even the processor in my old eBook reader has the ARM Neon instructions.
Why would it be irrelevant? Even the paucity of availability isn't really a problem - the big winners here are server users in data centers, not desktops or laptops. How much string parsing and munging is happening ingesting big datasets right now? If running a specially optimized function set on part of your fleet reduces utilization, that's direct cost savings you realize. If the AMD is then widening that support base, you're deeply favoring expanding usage while you scale up.
Given Intel's AVX extension could cause silent failures on servers (very high work load for prolonged time, compare to end user computers), I'm not sure it would be a big win for servers either: https://arxiv.org/pdf/2102.11245.pdf.
I'm downvoting you because the assertion you're implying--that use of AVX increases soft failure rates more than using non-AVX instructions would--is not sustained by the source you use as reference.
Indeed, I'd summarise that source as "At Facebook sometimes weird stuff happens. We postulate it's not because of all the buggy code written by Software Engineers like us, it must be hardware. As well as lots of speculation about hypothetical widespread problems that would show we're actually not writing buggy software, here's a single concrete example where it was hardware".
If anything I'd say that Core 59 is one of those exceptions that prove the rule. This is such a rare phenomenon that when it does happen you can do the work to pin it down and say yup, this CPU is busted - if it was really commonplace you'd constantly trip over these bugs and get nowhere. There probably isn't really, as that paper claims, a "systemic issue across generations" except that those generations are all running Facebook's buggy code.
One interesting anecdote is that HPC planning for exascale included significant concern about machine failures and (silent) data corruption. When running at large enough scale, even seemingly small failure rates translate into "oh, there goes another one".