Even manual vectorization is pain...writing ASM, really?
Rust has unstable portable SIMD and a few third-party crates, C++ has that as well, C# has stable portable SIMD and a small out of box BLAS-like library to help with most common tasks (like SoftMax, Magnitude and etc. on top of spans of floats over writing manually), hell it even exercises PackedSIMD when ran in a browser. And now Java is getting Panama vectors some time in the future (though the question of codegen quality stands open given planned changes to unsafe API).
Go among these is uniquely disadvantaged. And if that's not enough, you may want to visit 1Brc's challenge discussions and see that Go struggles to get anywhere close to 2s mark with both C# and C++ blazing past it:
Panama vectors are extremely disappointing. ByteVector.rearrange in particular takes like 10ns and is the only available way to implement vpshufb, an instruction that takes 1 cycle. Operations like andnot don't just use the andnot instruction. Converting a 32-wide vector that the type system thinks is a mask into a vector uses a blend instead of using 0 instructions. Fixed rearranges like packus are missing. Arithmetic operations that are not simple lane-wise operations like maddubs are missing. aesenc is missing. Non-temporal stores and non-temporal prefetches are missing (there is a non-temporal load instruction but apparently it doesn't do anything differently from a normal load, so if you want to move data to L1d skipping other caches you have to use the prefetch).
Sure, in a few weeks I will post on the mailing list about how lots of stuff one wants to do with vectors is many times slower because of these issues, and we'll see if they end up adding ByteVector.multiplySignedWithUnsignedGivingShortsAndAddPairsOfAdjacentShorts so that people can write decimal parsers or not.
Is it really worth the trouble if you're not building on top of something like LLVM which already has a vectorizer? We're still waiting for the mythical sufficiently-smart-vectorizer, even the better ones are still extremely brittle, and any serious high-performance work still does explicit SIMD rather than trying to coax the vectorizer into cooperating.
I'd rather see new languages focus on making better explicit SIMD abstractions a la Intels ISPC, rather than writing yet another magic vectorizer that only actually works in trivial cases.
any of the polyhedral frameworks is reasonably good at splitting loop nests into parallelizable ones.
Then it's just a codegen problem.
But yes, ultimately, the user needs to be aware of how the language works, what is parallelizable and what isn't, and of the cost of the operations that they ask their computer to execute.
I never do this kind of work so I can’t say. But if I did, I’d imagine I want more control. I mean, perf improvements are welcome to all code, but if I need a piece of code to have a specific optimization I’d rather opt-in through language constructs, so that the compiler (or other tooling) can tell me when it breaks. A well designed API with adapters from and to regular code would be better, no?
For instance, imagine I have auto-perf something and I check (manually mind you) the asm and all is good. Then someone changes the algorithm slightly, or another engineer adds a layer of indirection for some unrelated purpose, or maybe the compiler updates its code paths which misses some cases that were previously supported. And the optimization goes away silently.
Do you have the same view of other compiler optimizations? Would you prefer if the compiler never unrolled a loop so that you can write it out manually when you need it?
No I wasn’t saying (or at least didn’t mean) the compiler shouldn’t optimize automatically. I meant that ensuring certain paths are optimizable could be important when you need it. And also that language constructs would be a good way to achieve that.