Even manual vectorization is pain...writing ASM, really?
Rust has unstable portable SIMD and a few third-party crates, C++ has that as well, C# has stable portable SIMD and a small out of box BLAS-like library to help with most common tasks (like SoftMax, Magnitude and etc. on top of spans of floats over writing manually), hell it even exercises PackedSIMD when ran in a browser. And now Java is getting Panama vectors some time in the future (though the question of codegen quality stands open given planned changes to unsafe API).
Go among these is uniquely disadvantaged. And if that's not enough, you may want to visit 1Brc's challenge discussions and see that Go struggles to get anywhere close to 2s mark with both C# and C++ blazing past it:
Panama vectors are extremely disappointing. ByteVector.rearrange in particular takes like 10ns and is the only available way to implement vpshufb, an instruction that takes 1 cycle. Operations like andnot don't just use the andnot instruction. Converting a 32-wide vector that the type system thinks is a mask into a vector uses a blend instead of using 0 instructions. Fixed rearranges like packus are missing. Arithmetic operations that are not simple lane-wise operations like maddubs are missing. aesenc is missing. Non-temporal stores and non-temporal prefetches are missing (there is a non-temporal load instruction but apparently it doesn't do anything differently from a normal load, so if you want to move data to L1d skipping other caches you have to use the prefetch).
Sure, in a few weeks I will post on the mailing list about how lots of stuff one wants to do with vectors is many times slower because of these issues, and we'll see if they end up adding ByteVector.multiplySignedWithUnsignedGivingShortsAndAddPairsOfAdjacentShorts so that people can write decimal parsers or not.
Rust has unstable portable SIMD and a few third-party crates, C++ has that as well, C# has stable portable SIMD and a small out of box BLAS-like library to help with most common tasks (like SoftMax, Magnitude and etc. on top of spans of floats over writing manually), hell it even exercises PackedSIMD when ran in a browser. And now Java is getting Panama vectors some time in the future (though the question of codegen quality stands open given planned changes to unsafe API).
Go among these is uniquely disadvantaged. And if that's not enough, you may want to visit 1Brc's challenge discussions and see that Go struggles to get anywhere close to 2s mark with both C# and C++ blazing past it:
https://hotforknowledge.com/2024/01/13/1brc-in-dotnet-among-...
https://github.com/gunnarmorling/1brc/discussions/67