TIL that the Go compiler still does not have autovectorization.

neonsunset · on Jan 23, 2024

Even manual vectorization is pain...writing ASM, really?

Rust has unstable portable SIMD and a few third-party crates, C++ has that as well, C# has stable portable SIMD and a small out of box BLAS-like library to help with most common tasks (like SoftMax, Magnitude and etc. on top of spans of floats over writing manually), hell it even exercises PackedSIMD when ran in a browser. And now Java is getting Panama vectors some time in the future (though the question of codegen quality stands open given planned changes to unsafe API).

Go among these is uniquely disadvantaged. And if that's not enough, you may want to visit 1Brc's challenge discussions and see that Go struggles to get anywhere close to 2s mark with both C# and C++ blazing past it:

https://hotforknowledge.com/2024/01/13/1brc-in-dotnet-among-...

https://github.com/gunnarmorling/1brc/discussions/67

anonymoushn · on Jan 24, 2024

Panama vectors are extremely disappointing. ByteVector.rearrange in particular takes like 10ns and is the only available way to implement vpshufb, an instruction that takes 1 cycle. Operations like andnot don't just use the andnot instruction. Converting a 32-wide vector that the type system thinks is a mask into a vector uses a blend instead of using 0 instructions. Fixed rearranges like packus are missing. Arithmetic operations that are not simple lane-wise operations like maddubs are missing. aesenc is missing. Non-temporal stores and non-temporal prefetches are missing (there is a non-temporal load instruction but apparently it doesn't do anything differently from a normal load, so if you want to move data to L1d skipping other caches you have to use the prefetch).

pjmlp · on Jan 24, 2024

Panama vectors are still in preview anyway.

anonymoushn · on Jan 25, 2024

Sure, in a few weeks I will post on the mailing list about how lots of stuff one wants to do with vectors is many times slower because of these issues, and we'll see if they end up adding ByteVector.multiplySignedWithUnsignedGivingShortsAndAddPairsOfAdjacentShorts so that people can write decimal parsers or not.

Thaxll · on Jan 24, 2024

5sec for simple and readable code: https://gist.github.com/corlinp/176a97c58099bca36bcd5679e68f...

Have you seen the 2sec code from c#?

anonymoushn · on Jan 24, 2024

These numbers aren't comparable. This golang solution is likely much more than 2.5x slower if you run them on the same hardware.

jsheard · on Jan 23, 2024

Is it really worth the trouble if you're not building on top of something like LLVM which already has a vectorizer? We're still waiting for the mythical sufficiently-smart-vectorizer, even the better ones are still extremely brittle, and any serious high-performance work still does explicit SIMD rather than trying to coax the vectorizer into cooperating.

I'd rather see new languages focus on making better explicit SIMD abstractions a la Intels ISPC, rather than writing yet another magic vectorizer that only actually works in trivial cases.

neonsunset · on Jan 23, 2024

C# is doing that :)

https://learn.microsoft.com/en-us/dotnet/api/system.runtime....

Examples of usage:

- https://github.com/U8String/U8String/blob/main/Sources/U8Str...

- https://github.com/nietras/1brc.cs/blob/main/src/Brc/BrcAccu...

- https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...

(and many more if you search github for the uses of Vector128/256<byte> and the like!)

pjmlp · on Jan 24, 2024

Java as well, unfortunelly it will be kept around as preview until Valhala arrives (if ever).

mgaunard · on Jan 23, 2024

any of the polyhedral frameworks is reasonably good at splitting loop nests into parallelizable ones.

Then it's just a codegen problem.

But yes, ultimately, the user needs to be aware of how the language works, what is parallelizable and what isn't, and of the cost of the operations that they ask their computer to execute.

hoten · on Jan 23, 2024

Maybe a hidden blessing. Just ran into a nasty MSVC auto vectorization bug. Apparently it's hard to get right.

https://developercommunity.visualstudio.com/t/Bad-codegen-du...

klabb3 · on Jan 23, 2024

I never do this kind of work so I can’t say. But if I did, I’d imagine I want more control. I mean, perf improvements are welcome to all code, but if I need a piece of code to have a specific optimization I’d rather opt-in through language constructs, so that the compiler (or other tooling) can tell me when it breaks. A well designed API with adapters from and to regular code would be better, no?

For instance, imagine I have auto-perf something and I check (manually mind you) the asm and all is good. Then someone changes the algorithm slightly, or another engineer adds a layer of indirection for some unrelated purpose, or maybe the compiler updates its code paths which misses some cases that were previously supported. And the optimization goes away silently.

koala_man · on Jan 24, 2024

Do you have the same view of other compiler optimizations? Would you prefer if the compiler never unrolled a loop so that you can write it out manually when you need it?

klabb3 · on Jan 24, 2024

No I wasn’t saying (or at least didn’t mean) the compiler shouldn’t optimize automatically. I meant that ensuring certain paths are optimizable could be important when you need it. And also that language constructs would be a good way to achieve that.

mcronce · on Jan 23, 2024

They still don't have a register-based calling convention on architectures other than x86-64, right? Or is that information out of date?

mseepgood · on Jan 23, 2024

That information is out of date: ARM64, PPC, RISC-V https://go.dev/src/cmd/compile/abi-internal

icholy · on Jan 23, 2024

Last time I checked, it didn't even unroll loops.