Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

So I think I managed to do all of the above with varying degrees of success: https://godbolt.org/z/WY99vxs76

* parse_fract: I got that one down to 23 instructions on icx, although gcc took 81 and clang 91

Since both of the other two return an index, I decided to keep it simple and use a 128 iteration inner loop that accumulates the index into a 8-bit integer, so I don't have to widen. 128 instead of 256, because I needed a sentinel value.

* find_unmatched: obviously the compiler couldn't figure out the clmul trick. icx: 0.86 instr/byte, gcc: 0.625 instr/byte, clang at failed to vectorize the +-scan.

* find_special: The LUT didn't end up working that well, so I'm doing the four comparisons separately. icx: 0.45 instr/byte, gcc: 0.30 instr/byte, clang: 0.25 instr/byte

(I used znver5 as the target for gcc and clang, but znver4 for icx)

These were more painful to do than need be, somebody should try it with ISPC and see how that compares.

I didn't know about the inclusive scan support in OpenMP before writing this. It's almost good, but the implementations are slightly buggy, and it seems to be designed with threading, not SIMD, in mind. In the sense that you have to write the scan into an array, while in SIMD you don't need that, in multi-threading you need the buffer to do a scan-tree-reduction.

The other problem is early exit loops, which should totally be permissible. icc also had support for early_exit, but icx doesn't support it anymore. Wouldn't you "just" need to do an or reduction on the condition mask and break if one bit was set?

Thanks for the suggestions. Sounds like you are working on some kind of parser?



So I looked at these briefly without AVX512, because only a tiny fraction of people have anything like that and the claim was that this would be a great way of making portable SIMD :-) Also, obviously you cannot use -ffast-math in real code.

parse_fract seems really, really inefficient. Even on plain SSE2, you can do with unpack + sub + muladd + shuffle + add and then some scalar cleanup (plus the loads, of course). icx looks to be, what, 40 SIMD instructions?

find_unmatched is just scalar code cosplaying SIMD; 150 instructions and most of them do one byte at a time.

find_special seemingly generates _thousands_ of instructions! What you need, for each 16-byte loop, is a pshufb + compare + pmovmskb (0x80.pl is down now, but IIRC it's explained there). It helps that all of the values in question differ in one of the nibbles.

I am not that convinced by this being a usable universal technique :-) Even after switching back to supporting AVX512, the generated code is a huge mess. Seemingly these three functions together clock in at 6+ kB, which is ~10% of your entire icache.

> Sounds like you are working on some kind of parser?

All of these are for HTML/CSS, yes.


Yeah, I wouldn't use this type of thing for these problems, this was a huge mess. But I think code designed for autovec is resonable as a scalar base implementation for a larger set of problems than people think.

I've seen the problem before, where people explicitly vectorized (sse) something, but the code could be autovectorized (avx) and outperformed the explicitly vectorized one, because the compiler could take advantage of newer extensions.

You really should be able to use a ISPC like subset in your C code, OpenMP goes into the right direction, but it's not designed with SIMD as the highest priority.


Yes, I'd say that the strongest part of autovectorization is that you can get more-or-less automatic support for wider/newer instruction sets than what you had when you wrote the code; older SIMD, like all other code, tends to rot. Of course, this is predicated on either having function multiversioning (works, but very rare) or being able to frequently raise the minimum target for your binary.


You also get the automatic support for newer instructions (and multiversioning) with a wrapper library such as our Highway :)




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: