Hacker News new | past | comments | ask | show | jobs | submit login
From slow to SIMD: A Go optimization story (sourcegraph.com)
254 points by rbanffy on Jan 23, 2024 | hide | past | favorite | 104 comments



> why is it significant that we slice like a[i:i+4:i+4] rather than just a[i:i+4]?

Well I had never seen that "full slice" expression syntax before. It turns out that it's important because it controls the capacity of the new slice. The capacity of the new slice is now i+4 - i.

So by using the full slice expression you get a slice of length 4 and capacity 4. Without doing this the capacity would be equal to the capacity of the original slice.

I suppose that by controlling the capacity that you eliminate the bounds check.


In my testing [1] that doesn't eliminate bound checks. Instead, it avoids a computation of otherwise unused `cap(a[i:i+4]) = len(a) - i` value if my reading is correct.

[1] https://go.godbolt.org/z/63n6hTGGq (original) vs. https://go.godbolt.org/z/YYPrzjxP5 (capacity not limited)

> Well I had never seen that "full slice" expression syntax before.

Go's notion of capacity is somewhat pragmatic but at the same time confusing as well. I learned the hard way that the excess capacity is always available for the sake of optimization:

    a := []int{1, 2, 3, 4, 5}
    lo, hi := a[:2], a[2:]
    lo = append(lo, 6, 7, 8)      // Oops, it tries to reuse `lo[2:5]`!
    fmt.Printf("%v %v\n", lo, hi) // Prints `[1 2 6 7 8] [6 7 8]`
While I do understand the rationale, it is too unintuitive because there is no indication of the excess capacity in this code. I would prefer `a[x:y]` to be a shorthand for `a[x:y:y]` instead. The `a[x:y:len(a)]` case is of course useful though, so maybe a different shorthand like `a[x:y:$]` can be added.


I think slices are wonderfully intuitive... if you've worked with C.

Slices encapsulate an ubiquitous C pattern, where you pass to a function: a pointer (to an initial array element), and a length or capacity (sometimes both). This pattern is directly encapsulated by Go's slices, which can be thought of as something like:

  type Slice struct {
      Ptr *Elem
      Len int
      Cap int
  }
I love Go's slice syntax. It's the right piece of syntactic sugar. It removes tedium and room-for-mistakes from this prevalent C pattern. It lets me work as precisely with memory as I do in C, yet everything is simpler, lighter, and breezier.

For example, I'm making a game in Go. I don't want to be allocating memory throughout the game (which can cause frame drops), so instead I allocate giant arrays of memory on launch. Then I use slices to partition this memory as needed.

Some of these arrays get their elements re-accumulated at some interval (every level, or every frame, etc.). And so it works nicely to first set them to zero length:

  myMem = myMem[:0]
Notice that the myMem slice now has zero length, but still points to its same underlying array. Then I perform the accumulation:

  for ... {
      myMem = append(myMem, elem)
  }
Again, I don't want to be allocating in general, so I care very much that append continues to use myMem's preexisting capacity.

All this is to say, I don't see slices as being the way they are for the "sake of optimization." Rather I see them as an ideal tool for working with, referring to, and partitioning memory.


C doesn't have any slicing operator like Go's `a[x:y]`, which is the main problem I want to point out. Slice itself is just a natural construction.


Yes, I'm saying Go's slice operator (and slice type, and related functions) grew out of experience from the way arrays are used in C.

To me, this becomes obvious if you read The Practice of Programming by Kernighan and Pike, the latter of which was a co-designer of Go. If you read the book, which was written well before Go, and pay attention to how it uses C arrays, you can almost feel Go slices emerging. The slice syntax perfectly encapsulates the C array bookkeeping.


I'm not sure how it can be possible. In my experience the notion of three-part slices does exist in C but only implicitly. For example,

    size_t trim_end(const char *p, size_t len) {
        while (len > 0) {
            if (p[len - 1] != ' ') break;
            --len;
        }
        return len;
    }
Conceptually this function accepts a slice `(p, len, cap)` and returns a slice `(p, len2, cap)` where `len2 <= len` and the capacity never changes. But the actual argument doesn't have `cap`, and the return argument doesn't have `p`. Everything is implicit and it's typical for C programmers to fully document and follow such implicits. Go's slice operator can't come out of such implicit practices in my opinion.

In comparison, your claim only makes sense when the following was somehow normal:

    struct slice { const char *p; size_t len, cap; };
    struct slice trim_end(const struct slice *s) {
        struct slice out = *s;
        while (out.len > 0) {
            if (out.p[out.len - 1] != ' ') break;
            out = subslice(out, 0, out.len - 1);
        }
        return out;
    }
Note that a hypothetical `subslice` function call maps perfectly to a Go code `out[0:len(out)-1]`, and my complaint will equally apply: there should be two clearly named variants of `subslice` that may or may not keep the capacity. But I hardly saw such construction in C.


I feel like were talking past each other. I'm not saying that C has a slicing operator, or that it's typical to define one as a function, or that it's typical to define a slice-like struct in C.

I'm saying that if you look at how arrays get used in C, you'll see that you're usually passing around extra numbers with them. So Go added syntax than encapsulates this. And it encapsulates the most general case (hence both a length and a capacity, even though most cases in C only use one number).

Instead of passing (char *, int) in C, you just pass (slice) in Go. And the Go slicing operator gives a nice syntax for selecting ranges of buffers.

But a Go slice is just a pointer to an array underneath, and I think it's best to always think about them that way. And then it's natural that mySlice[:2] would keep the same underlying capacity, and that future appends would mutate that capacity. Defaulting mySlice[:2] to mean mySlice[:2:2] seems less convenient overall to me (though prevents mistakes like in your original hi/lo example, but those come from not thinking of slices in terms of the underlying array).


> Instead of passing (char *, int) in C, you just pass (slice) in Go.

Maybe that's a point of disagreement here. I don't exactly see (char *, int) as a single entity, so it cannot be replaced with a slice (a definitely single entity) in my view.

They do appear together in arguments, but they have to be separate variables when you manipulate them and no C syntax will suggest that they are indeed related. So you have to rely on conventions (say, `name`, `name_len` and `name_cap`) which tend to be fragile. You can't expect such annoyance from Go slices, so I argue they are necessarily different from each other.


It works fine if you replace that by the same thing, which is what most langages (with slices) do.

The problem is that Go does not do that, because its designers could not be arsed to provide a separate vector type it’s (ptr, int, int) and now you can start stomping on out-of-slice memory unless the creator of the slice has used the extended slicing form.


Wow that seems pretty unsafe...

In D, for example, this works as most people would expect:

    import std.stdio;
    void main() {
      int[] a = [1, 2, 3, 4, 5];
      auto lo = a[0 .. 2], hi = a[2 .. $];
      lo ~= [6,7,8]; // new allocation
      writefln("%s %s", lo, hi);  // prints [1, 2, 6, 7, 8] [3, 4, 5]
    }
You simply can't overwrite a backing array from a slice (unless you do unsafe stuff very explicitly).


This like most issues with slices stem directly from them pulling double duty as vectors.

Capacity being conserved by slicing is important to the “slice tricks” of removing items, I assume.


You’re also never increasing the size of those slices right? So it’s better memory wise by a bit and maybe faster? I last used Go a while ago, but I recall the capacity was the length of the underlying array? Internally it may even reuse the same arrays for slices maybe, since they’re not changing size each loop iteration.

Edit: weird, this was supposed to be an update to a previous comment I made, but this is a different comment now


You’re also never increasing the size of those slices right? So it’s better memory wise by a bit and maybe faster? I last used Go a while ago, but I recall the capacity was the length of the underlying array?

Edit2: (I’m throttled and I can’t post a new comment I guess? Makes me feel like my voice was stolen! I guess I’m not wanted around HN.)

Thanks for the correction, I would delete my comment but I found a bug in HN while updating it so I’ll preserve my stupidity here in duplicate for now.


There is no copying of memory going on, the a[i: i+4] or a[i: i+4: i+4] slice references the same underlying data as in a.


I wonder whether avo could have been useful here?[1] I mention it because it came up the last time we were talking about AVX operations in go.[2, 3]

1 = https://github.com/mmcloughlin/avo

2 = https://news.ycombinator.com/item?id=34465297

3 = https://www.reddit.com/r/golang/comments/10hmh07/how_to_use_...


I did consider Avo! I even went as far as to implement a version using Avo since it has a nice dot product example I could use as a starting point. But ultimately, for as small as these functions are, I felt that Avo was an unnecessary extra layer to grok. Additionally, it's x86-only, and I knew in advance I'd want to implement an ARM version as well since we also do some embeddings stuff locally.

If I were to ever take this further and add loop unrolling or something, I'd absolutely reach for Avo


Thanks for sharing, that is a very interesting coda to the story!


I appreciate that there’s so many engineers out there that love to make these kinds of improvements. I need them! I’m a high-level person focused mostly on the product, and we would never have a good product without these smart people.

Keep up the good work!


I learned yesterday about GoLang's assembler https://go.dev/doc/asm - after browsing how arrow is implemented for different languages (my experience is mainly C/C++) - https://github.com/apache/arrow/tree/main/go/arrow/math - there are bunch of .S ("asm" files) and I'm still not able to comprehend how these work exactly (I guess it'll take more reading) - it seems very peculiar.

The last time I've used inlined assembly was back in Turbo/Borland Pascal, then bit in Visual Studio (32-bit), until they got disabled. Then did very little gcc with their more strict specification (while the former you had to know how the ABI worked, the latter too - but it was specced out).

Anyway - I wasn't expecting to find this in "Go" :) But I guess you can always start with .go code then produce assembly (-S) then optimize it, or find/hire someone to do it.


I wonder how well a simple C for loop with -O3 and maybe -march would do here.

From my brief forays into reading (mostly AARCH64) assembly, it looks like C compilers can detect these kinds of patterns now and just convert them all to SIMD by themselves, with no work from the programmer. Even at -O2, converting an index-based loop into one based on start and end pointers is not unusual. Go doesn't seem to do this, the assembly output by the Go compiler looks much closer to the actual code than what you get from C.

Rust iterators would also be fun to benchmark here, they're supposed to be as fast as plain old loops, and they're probably optimized to omit bounds checks entirely.


> Rust iterators would also be fun to benchmark here

I started to write this out, and then thought "you know what given how common this is, I bet I could even just google it" and thought that would be more interesting, as it makes it feel more "real world." The first result I got is what I would have written: https://stackoverflow.com/a/30422958/24817

Here's a godbolt with three different outputs: one at -O, one at -O3, and one at -03 and -march=native

https://godbolt.org/z/6xf9M1cf3

Eyeballing it comments:

Looks like 2 and 3 both provide extremely similar if not identical output.

Adding the native flag ends up generating slightly different codegen, I am not at the level to be able to simply look at that and know how meaningful the difference is.

It does appear to have eliminated the bounds check entirely, and it's using xmm registers.

I am pleasantly surprised at this output because zip in particular can sometimes hinder optimizations, but rustc did a great job here.

----------------------

For fun, I figured "why not also try as direct of the original Go as possible." The only trick here is that Rust doesn't really do the c-style for loop the way Go does, so I tried to translate what I saw as the spirit of the example: compare the two lengths and use the minimum for the loop length.

Here it is: https://godbolt.org/z/cTcddc8Gs

... literally the same. I am very surprised at this outcome. It makes me wonder if LLVM has some sort of idiom recognition for dot product specifically.

EDIT: looks like it does not currently, see the comment at line 28 and 29: https://llvm.org/doxygen/LoopIdiomRecognize_8cpp_source.html


Those aren't vectorized at all, just unrolled. vmulss/vaddss just multiply/add the single-precision floating-point in the vector register.

With clang you get basically the same codegen, although it uses fused multiply adds.

The problem is that you need to enable -ffast-math, otherwise the compiler can't change the order of floating point operations, and thus not vectorize.

With clang that works wonderfully and it gives us a lovely four times unrolled AVX2 fused multiply add loop, but enabling it in rust doesn't seem to work: https://godbolt.org/z/G4Enf59Kb

Edit: from what I can tell this is still an open issue??? https://github.com/rust-lang/rust/issues/21690

Edit: relevant SO post: https://stackoverflow.com/questions/76055058/why-cant-the-ru... Apparently you need to use `#![feature(core_intrinsics)]`, `std::intrinsics::fadd_fast` and `std::intrinsics::fmul_fast`.


Ahh yes, thank you.

Rust doesn't have a -ffast-math flag, though it is interesting that you passed it directly to llvm. I am kinda glad that escape hatch doesn't work, to be honest.

There are currently unstable intrinsics that let you do this, and you seemingly get close to clang codegen with them: https://godbolt.org/z/EEW79Gbxv

The thread tracking this discusses another attempt at a flag to enable this by turning on the CPU feature directly, but that doesn't seem to affect codegen in this case. https://github.com/rust-lang/rust/issues/21690

It would be nice to get these intrinsics stabilized, at least.

EDIT: oops you figured this out while I was writing it, haha.


Even without a -ffast-math flag, the current stable Rust compiler will vectorize loops on integer types.

https://godbolt.org/z/KjErzacfv

Edit: ...and I now realize who I responded to, I'm sure you already know this. :)


Loops on floats are fine, its just reduction operations that hit the issue with the associativity assumption for floats leading to UB. You can trick it by making f32x16 types like the wide crate does or if you use nightly simba can do it with const generic expressions.


It was just last week I was reading a comment that made it seem like you shouldn't really use -ffast-math[0], but here looks like a non-rare reason why you would want to enable it.

What is correct idiom here? It feels if this sort of thing really matters to you, you should have the know how to handroll a couple lines of ASM. I want to say this is rare, but I had a project a couple years ago where I needed to handroll some vectorized instructions on a raspberry pi.

[0] https://news.ycombinator.com/item?id=39013277


Definitely don't take an HN comment as a serious suggestion. Enable fast-math for your code, run your objective evaluation that's suitable for your domain, and if it passes the test, enjoy the added speed.

FWIW I have oodles of numerical C++ code where fast-math doesn't change the output.


For a very long time `-funsafe-math-optimizations` (and thus `-ffast-math`) had been infectious [1], so a responsible library should never have used `-ffast-math` anyway.

You are right in that the final binary is free to turn `-ffast-math` on if you can verify that everything went okay. But almost no one would actually verify that. It's like an advice that you shouldn't write your own crypto code---it's fine if you know what you are doing, but almost no one does, so the advice is technically false but still worthwhile.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55522 (GCC), https://github.com/llvm/llvm-project/issues/57589 (LLVM)


That sounds like trying to run a program to check if it has underground behavior. How do you make a test that's comprehensive and future-compiler-safe?


The usual technique is to keep a 4-element array of sums (so sum[j] is the sum of all terms of the form a[4*i + j] * b[4*i + j]), and then take the total at the very end. This allows for the use of vectorization even with strict IEEE-compliance.

Generally, I would recommend against -ffast-math mostly because it enables -ffinite-math-only and that one can really blow up in your face. Most other flags (like -funsafe-math-operations) aren't that bad from an accuracy standpoint. Obviously you should not turn them on for code that you have actually tuned to minimize error, but in other cases they barely ever degrade the results.


The right path forward for Rust here in my opinion is to do the same thing as is done for math operations like saturating: stabilize a function or method that performs the operation with this semantic, and then build thin wrappers on top to make using them more convenient.


Imo the best solution would be special "fast" floating-point types, that have less strict requirements.

I personal almost always use -ffast-math by default in my C programs that care about performance, because I almost never care enough about the loss in accuracy. The only case I remember needing it was when doing some random number distribution tests where I cared about subnormals, and got confused for a second because they didn't seem to exist (-ffast-math disables them on x86).


That or a scoped optimization directive. GCC does allow `__attribute__((optimize("-ffast-math")))` as a function-wide attribute, but Clang doesn't seem to have an equivalent and the standard syntax `[[gcc::optimize("-ffast-math")]]` doesn't seem to work as well. In any case, such optimization should be visible from the code in my opinion.


The problem is that it takes a single piece of code compiled with -ffast-math to break everything, it's simply not worth it


GP seems to be saying that you can flag individual functions in GCC, thereby avoiding this issue: only flagged functions would be compiled with fast math semantics.


This only work if it's a leaf function that will throw away the result. If you feed the result of your --fast-math function into other working code you risk breaking it.


`-ffast-math` is fully local, asides from GCC's unexpected `crtfastmath.o` linkage which is global.

Functions with `-ffast-math` enabled still return fp values via usual registers and in usual formats. If some function `f` is expected to return -1.0 to 1.0 for particluar inputs, `-ffast-math` can only make it to return 1.001 or NaN instead. If another function without `-ffast-math` expects and doesn't verify f's return value, it will surely misbehave, but only because the original analysis of f no longer holds.

`-ffast-math` the compiler option is bad because this effect is not evident from the code. Anything visible in the code should be okay.


> you should have the know how to handroll a couple lines of ASM

For what architecture? What if this code is in a library that your users might want to run on Intel (both 32 and 64 bit), ARM, Risc V and s390x? Even if you learn assembly for all of these, how are you going to get access to an S390X IBM mainframe to test your code? What if a new architecture[1] gets popular in the next couple of years, and you won't have access to a CPU to test on?

Leaving this work to a compiler or architecture-independent functions / macros that use intrinsics under the hood frees you from having to think about all of that. As long as whatever the user is running on has decent compiler support, your code is going to work and be fast, even years later.

[1] https://en.wikipedia.org/wiki/Loongson


You can write it like this to get the compiler to generate SIMD: https://godbolt.org/z/ohvoEb7er

It's certainly not perfect though (in particular the final reduction/remainder handling).

Unfortunately Rust doesn't have a proper optimizing float type. I really wish there was a type FastF32 or something similar which may be optimized using the usual transformation rules of algebra (e.g. associative property, distributive property, x + y - y = x, etc).

There is fadd_fast and co, but those are UB on NaN/infinite input.


Usually one wants a subset of the thing that -ffast-math does, e.g. -fassociative-math. And only within some limited scope.


I played around with the example a bit, the minimum for vectorization seems to be -fassociative-math -fno-signed-zeros. The gcc docs say -fassociative-math requries -fno-signed-zeros and -fno-trapping-math though.

I suppose -fassociative-math -fno-signed-zeros -fno-trapping-math -freciprocal-math will get you most of the way there, and maybe an -ffinite-math-only when appropriate.


fast-math changes even libraries not compiled with it, so buyer beware.


I had a go at writing this, both the initial float version and the later integer-based version. Note that I haven't actually tested these to make sure that output is correct, so, I might have messed up somewhere, and that I targeted Rocketlake for the CPU.

For the float-based version[0] I had to break out the unstable portable_simd to get it to vectorize. Most of the function ends up being setting up everything, but then actually doing the calculation is simple, and basically the same as non-SIMD section. I've never used the portable SIMD stuff before, and it was quite pleasant to use.

For the integer-based version, I started with the simple naive approach[1], and that vectorized to a pretty good degree on stable. However, it doesn't use the dot-product instruction. For that, I think we need to use nightly and go a bit manual[2]. Unsurprisingly, it mostly ends up looking like the float version as a fair chunk is just setup. I didn't bother here, but it should probably be using feature detection to make sure the instruction exists.

[0] https://godbolt.org/z/Gdv8azorW [1] https://godbolt.org/z/d8jv3ofYo [2] https://godbolt.org/z/4oYEnKTbf


It depends.

You need 2~3 accumulators to saturate instruction-level parallelism with a parallel sum reduction. But the compiler won't do it because it only creates those when the operation is associative, i.e. (a+b)+c = a+(b+c), which is true for integers but not for floats.

There is an escape hatch in -ffast-math.

I have extensive benches on this here: https://github.com/mratsim/laser/blob/master/benchmarks%2Ffp...


In my experience, compilers rarely know how to make use of ILP even in some for what you would expect to be the "simple" cases. Handwriting the SIMD, at least in my case, almost always proved to be several times faster than the auto-vectorized code generated by the compiler.


They do reorder instructions. I think the SIMD part has more to do with loop analysis than ILP.

It's quite telling that there is a #pragma omp simd to hint to a compiler to rewrite the loop.

Now I wonder what's the state of polyhedral compilers. It's been many years. And given the AI, LLMs hype they could really shine.


> I think the SIMD part has more to do with loop analysis than ILP.

If you know how to rewrite the algorithm in such a way so that it makes close-to-ideal utilization of CPU ports through your SIMD then it is practically impossible to beat it. And I haven't seen a compiler (GCC, clang) doing such a thing or at least not in the instances I had written. I've measured substantial improvements from such and similar utilization of CPU-level microarchitectural details. So perhaps I don't think it's the loop analysis only but I do think it's practically an impossible task for the compiler. Perhaps with the AI ...


You don't necessarily need `-ffast-math` if you are willing to tweak your code a bit, since you can just introduce accumulators by yourself. Such optimizer-friendly code is not hard to write if you know basic principles.


That's what I'm saying though. Either youbdo the accumulators yourself, or you need a compiler escape hatch.


I mean, you don't need SIMD to put accumulators.


The compiler escape hatch is -ffast-math


This is the type of optimization we used to see in embedded hardware constrained environments.

Nowadays, even on most embedded platforms resources are so abundant and applications are HORIZONTALLY scaled that it seems to feel like nobody bothers with VERTICAL scaling.

The thing about multiplication is it is commutative; an 8x improvement in performance of a single node means you need 8x less nodes! What a fun way to cut down your AWS bill and send your SaaS company to profitability.


I don't know the truth about it, but what people say is that most unprofitable SaaS startups are the way they are because they're making no money at all, not that their costs are higher than their 2x-one-developer's-compensation revenues.


I think there's plenty of sass companies out there that are subsidizing what their customers are paying them with vc cash. I've been at 2 sass companies where infrastructure optimization was starting to be a real concern. If your unit economics are in the red and additional customers cost you money, you need to optimize somehow.


For other languages (including nodejs/bun/rust/python etc) you can have a look at SimSIMD which I have contributed to this year (made recompiled binaries for nodejs/bun part of the build process for x86_64 and arm64 on Mac and Linux, x86 and x86_64 on windows).

[0] https://github.com/ashvardanian/SimSIMD


I think SimSIMD doesn't have a Rust port yet. Given it's a relatively small and self-contained library, a source-to-source translation might be a better choice for Rust!


Yeah, it is all in the C header file. I thought it did rust, but instead it has something for go! OP should check it out.


This is a task where Halide https://halide-lang.org/ could really shine! It disconnects logic from scheduling (unrolling, vectorizing, tiling, caching intermediates etc), so every step the author describes in the article is a tunable in halide. halide doesn't appear to have bindings for golang so calling C++ from go might be the only viable option.

Edit: the author has valid points against FFI here https://news.ycombinator.com/item?id=39110692


The overhead of cgo seems like it would not be an issue if you could pass in an array of vectors and moved the outer loop into the procedure instead of calling the procedure for each vector. This may only be viable if there is a low overhead way to pass arrays back and forth between go and C. I'm no expert but a quick google makes me think this is possible.

I get that hand rolling assembly is fun but as a theoretical future maintainer I would much prefer C to ASM.


Definitely possible! However, I prefer to avoid cgo wherever possible for more than just the overhead.

In my experience:

- It complicates builds by requiring a C toolchain

- It makes single, static binaries more difficult

- It makes portability more difficult (not that assembly is portable though)

- It causes difficult-to-debug issues (I recently ran into an issue where MacOS signing changed, causing all my cgo binaries to be killed on startup)

- Debuggers don't work across Cgo boundaries (they do with go ASM!)

I think Dave Cheney said it best: https://dave.cheney.net/2016/01/18/cgo-is-not-go


The whole "cgo is not Go" bashing is a kind of joke, maybe if Go 2 ever comes to light, one of the first breaking changes should be to remove cgo.

There, beautiful Go code from that point onwards.


Dear Lord, no! I depend on cgo for my game!

What's more, I have no complaints about cgo. It hasn't been a performance problem at all (my game runs smoothly at 240hz). And the interface for binding to C is quite simple and nice to work with (I made my own bindings to SDL2, OpenGL, and Steamworks).

Cgo did make cross-compiling trickier to set up, but I managed to do it with a fair amount of elbow grease (Windows and MacOS builds on my Linux desktop).


That is my point with my sarcastic remark, cgo should be embraced for the workflows it offers, instead of the whole "cgo is not Go" drama.


The author used:

    sum := float32(0)
Over Go's zero-value default initialization e.g.

    var sum float32
A nit stylistically but wondering if there was a good reason to do so?


One advantage is that it makes the value explicit so anyone reading the code irregardless of their familiarity with Go will know it will be zero.


i do the same because i find the var keyword ugly


Sure, I understand that from the POV of making your code explicit. Personally I have always tended towards, for example:

    var x string
when initializing empty (zero-value) vars, versus:

    x := "hello"
when initializing variables that should hold an initial value.

To me as a Go programmer at least, this is more obvious and intuitive as to the intent of the declaration.


> In our unrolled code, the dependencies between multiply instructions are removed, enabling the CPU to take more advantage of pipelining. This increases our throughput by 37% compared to our naive implementation.

Can this 37% truly be attributed to loop unrolling? One really notable difference is the loop only compares against len(a) instead of both len(a) and len(b) in the unrolled version. I don't know enough about Go to know whether the compiler can optimize the comparison away, but in some other languages it would be significant.

> Also, getting my hands dirty with some assembly sounds fun, so that's what I'm going to do.

I'd have loved to see the comparison between the asm that the compiler was generating and the bespoke asm written here. I'd bet that simply gussying up the generated asm results in a pretty sizable improvement (but obviously less impressive than by switching to SIMD).


I would have tried parallelism just by curiosity. Split and spread the computation over multiple cores. If you have n cores you could get close to a factor n increase minus the cost of spreading the data and combining the results. That's an easy optimization right out of the box with go (no assembly required).


Oh, we do also heavily parallelize. This blog post is just focusing on the single-core perf.


Great stuff. We used to do this sort of opt for dot, axpy, gemm, etc. in the BLAS and LAPACK libs a long time ago.


I did a similar optimization via https://github.com/viterin/vek as the SIMD version. Some somewhat unscientific calculations showed a 10x improvement staying in float32: https://github.com/stillmatic/gollum/blob/07a9aa35d2517af8cf... (comparable to the 9x improvement in article using SIMD + int8)

TBH my takeaway was that it was more useful to use smaller vectors as a representation


Isn't Go quite the suboptimal language for performing linear algebra operations?


Not only linear algebra operations, had it not been for Docker and K8s successes...


From the mouth of this post's author: yes.


Offtopic, but I noticed that this website uses the SF Mono font on their code blocks, is that fine? And no, I don't work for Apple (or anytone, for that matter), I just checked because I'm searching for a (preferably FOSS) free font that looks like SF Mono.


I think SF Mono is the best as well. The closest fonts to it that I know of are Commit Mono, Source Code Pro and Consolas.


Isn't there a BLAS equivalent in go that could help here?


Copying in a response to a similar question on Reddit:

I should really add some discussion around BLAS in particular, which has an good implementation[0] of the float32 dot product that outperforms any of the float32 implementations in the blog post. I'm getting ~1.9m vecs/s on my benchmarking rig.

However, that BLAS became unusable for us as soon as we switched to quantized vectors because there is no int8 implementation of the dot product in BLAS (though I'd love to be proven wrong)

[0]: https://pkg.go.dev/gonum.org/v1/gonum@v0.14.0/blas/blas32#Do...


OpenBLAS with INTERFACE64=1 ??


AFAICT, that still doesn't support 8-bit integer dot product? (To clarify, I was using int8 to mean 8-bit integer, not 8-byte)


Is the best way to write AVX2/512 code really just smushing some assembly into your Go/C/C++?


In C or C++ you can use the functions in immintrin.h instead.


I am quite sure aCC and xlC don't have them.


C++ users can enjoy Highway [1].

[1] https://github.com/google/highway/


a) why not use switch { case len(a) < len(b): b = b[:len(a)] case len(b) < len(a): a = a[:len(b)] }

b) do you know the usual length of the slices? if so, can't you autogenerate/manually write switch to manually do the math? how would that perform?


TIL that the Go compiler still does not have autovectorization.


Even manual vectorization is pain...writing ASM, really?

Rust has unstable portable SIMD and a few third-party crates, C++ has that as well, C# has stable portable SIMD and a small out of box BLAS-like library to help with most common tasks (like SoftMax, Magnitude and etc. on top of spans of floats over writing manually), hell it even exercises PackedSIMD when ran in a browser. And now Java is getting Panama vectors some time in the future (though the question of codegen quality stands open given planned changes to unsafe API).

Go among these is uniquely disadvantaged. And if that's not enough, you may want to visit 1Brc's challenge discussions and see that Go struggles to get anywhere close to 2s mark with both C# and C++ blazing past it:

https://hotforknowledge.com/2024/01/13/1brc-in-dotnet-among-...

https://github.com/gunnarmorling/1brc/discussions/67


Panama vectors are extremely disappointing. ByteVector.rearrange in particular takes like 10ns and is the only available way to implement vpshufb, an instruction that takes 1 cycle. Operations like andnot don't just use the andnot instruction. Converting a 32-wide vector that the type system thinks is a mask into a vector uses a blend instead of using 0 instructions. Fixed rearranges like packus are missing. Arithmetic operations that are not simple lane-wise operations like maddubs are missing. aesenc is missing. Non-temporal stores and non-temporal prefetches are missing (there is a non-temporal load instruction but apparently it doesn't do anything differently from a normal load, so if you want to move data to L1d skipping other caches you have to use the prefetch).


Panama vectors are still in preview anyway.


Sure, in a few weeks I will post on the mailing list about how lots of stuff one wants to do with vectors is many times slower because of these issues, and we'll see if they end up adding ByteVector.multiplySignedWithUnsignedGivingShortsAndAddPairsOfAdjacentShorts so that people can write decimal parsers or not.


5sec for simple and readable code: https://gist.github.com/corlinp/176a97c58099bca36bcd5679e68f...

Have you seen the 2sec code from c#?


These numbers aren't comparable. This golang solution is likely much more than 2.5x slower if you run them on the same hardware.


Is it really worth the trouble if you're not building on top of something like LLVM which already has a vectorizer? We're still waiting for the mythical sufficiently-smart-vectorizer, even the better ones are still extremely brittle, and any serious high-performance work still does explicit SIMD rather than trying to coax the vectorizer into cooperating.

I'd rather see new languages focus on making better explicit SIMD abstractions a la Intels ISPC, rather than writing yet another magic vectorizer that only actually works in trivial cases.



Java as well, unfortunelly it will be kept around as preview until Valhala arrives (if ever).


any of the polyhedral frameworks is reasonably good at splitting loop nests into parallelizable ones.

Then it's just a codegen problem.

But yes, ultimately, the user needs to be aware of how the language works, what is parallelizable and what isn't, and of the cost of the operations that they ask their computer to execute.


Maybe a hidden blessing. Just ran into a nasty MSVC auto vectorization bug. Apparently it's hard to get right.

https://developercommunity.visualstudio.com/t/Bad-codegen-du...


I never do this kind of work so I can’t say. But if I did, I’d imagine I want more control. I mean, perf improvements are welcome to all code, but if I need a piece of code to have a specific optimization I’d rather opt-in through language constructs, so that the compiler (or other tooling) can tell me when it breaks. A well designed API with adapters from and to regular code would be better, no?

For instance, imagine I have auto-perf something and I check (manually mind you) the asm and all is good. Then someone changes the algorithm slightly, or another engineer adds a layer of indirection for some unrelated purpose, or maybe the compiler updates its code paths which misses some cases that were previously supported. And the optimization goes away silently.


Do you have the same view of other compiler optimizations? Would you prefer if the compiler never unrolled a loop so that you can write it out manually when you need it?


No I wasn’t saying (or at least didn’t mean) the compiler shouldn’t optimize automatically. I meant that ensuring certain paths are optimizable could be important when you need it. And also that language constructs would be a good way to achieve that.


They still don't have a register-based calling convention on architectures other than x86-64, right? Or is that information out of date?


That information is out of date: ARM64, PPC, RISC-V https://go.dev/src/cmd/compile/abi-internal


Last time I checked, it didn't even unroll loops.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: