Exactly. I think of them as Markov Chains in grammar space or in Abstract Syntax Tree space instead of n-gram chain-of-words space. The attention mechanism likely plays a role in identifying the parent in the grammar tree or identifying other types of back references like pronouns or if it's for programming languages, variable back references.
There was a time when people would estimate n-gram probabilities with feed-forward neural networks [1,2]. We just improved that with the (multilayer) attention mechanism which allows for better factoring over individual tokens. It also allowed for much larger n.
These techniques are limited to structures that can be checked with bounded history or bounded memory (that can be checked with a grammar or FSA). What about more complex structures that don't factor easily?
Youtube pretty aggressively blocks automated usage now. If you connect via a VPN it won't show any videos until you log in. Considering that ad income and real engagement seems unchanged, it possibly is just that they started blocking bots better. Something you wouldn't strictly expect a public announcement over.
I also find that they block incognito mode a lot too - esp. on android when you use a hacked client like revanced (less so in the browser, but i use firefox and have seen incognito mode causing youtube to block you from viewing without logging in).
Modern compilers all autovectorize really well. Usually writing plain canonical loops with plain C arrays is a good way to write portable optimal SIMD code.
The usual workflow I use is to translate the vector notation (RIP Cilk+ array syntax) in my paper notes to plain C loops. The compiler's optimization report (-qopt-report for icx, gcc has -fopt-info-vec and -fopt-info-vec-missed) gives feedback on what optimizations it considered and why it did not apply them. In more complex scenarios it can be helpful to add `#pragma omp simd` pragmas or similar to overrule the C semantics.
> Modern compilers all autovectorize really well. Usually writing plain canonical loops with plain C arrays is a good way to write portable optimal SIMD code.
I don't think I've seen this happen once at work (always using recent Clang) the last… three years? And I've written a fair bunch of SIMD code.
Autovectorization works fairly well when all of your code is very vertical (no interference between the elements) _and_ the compiler knows that your N is divisible by some pretty number (or else, that code bloat is OK enough that it can unroll and deal with the tail). And in the occasional very standard horizontal case (e.g., sum a bunch of bytes, N % 16 == 0). Otherwise, it's just a sad, sad story IME. For most of the algorithms in the presentation at hand, the compiler will have no chance at autovectorizing them.
I think you can coax most of them into autovectorizing via openmp and a bit of restructuring.
Can you give 1-3 examples of these type of problems where you think autovectorization is lacking? I want to give it a try.
Let me give you the last couple of routines that I was involved in vectorizing, no cherry-picking:
- Skip to the first unmatched } (i.e., if there's two { in there, you are looking for the third }), ignoring anything inside "", and that any character may be escaped using a backslash. (I've simplified away a bunch of complicating cases, or it would just sound unreasonable. The optimal path depends on whether you have e.g. PCLMUL.)
- Skip to the next <, \r, \0 or & (note that the optimal path on SSSE3+ uses pshufb, or similarly VTBL on NEON) and return its position. A typical case is that it's ~10 bytes away, but it could be kilobytes.
- Parse the decimal part of a double, given that only accuracy up to 1e-7 is needed. (The optimal solution differs somewhat between NEON and SSE2.)
The second one is the only one where I believe that autovectorization could have reasonably done a fair job, since it's so standard (but it didn't in practice).
* parse_fract: I got that one down to 23 instructions on icx, although gcc took 81 and clang 91
Since both of the other two return an index, I decided to keep it simple and use a 128 iteration inner loop that accumulates the index into a 8-bit integer, so I don't have to widen. 128 instead of 256, because I needed a sentinel value.
* find_unmatched: obviously the compiler couldn't figure out the clmul trick. icx: 0.86 instr/byte, gcc: 0.625 instr/byte, clang at failed to vectorize the +-scan.
* find_special: The LUT didn't end up working that well, so I'm doing the four comparisons separately. icx: 0.45 instr/byte, gcc: 0.30 instr/byte, clang: 0.25 instr/byte
(I used znver5 as the target for gcc and clang, but znver4 for icx)
These were more painful to do than need be, somebody should try it with ISPC and see how that compares.
I didn't know about the inclusive scan support in OpenMP before writing this. It's almost good, but the implementations are slightly buggy, and it seems to be designed with threading, not SIMD, in mind. In the sense that you have to write the scan into an array, while in SIMD you don't need that, in multi-threading you need the buffer to do a scan-tree-reduction.
The other problem is early exit loops, which should totally be permissible. icc also had support for early_exit, but icx doesn't support it anymore. Wouldn't you "just" need to do an or reduction on the condition mask and break if one bit was set?
Thanks for the suggestions. Sounds like you are working on some kind of parser?
So I looked at these briefly without AVX512, because only a tiny fraction of people have anything like that and the claim was that this would be a great way of making portable SIMD :-) Also, obviously you cannot use -ffast-math in real code.
parse_fract seems really, really inefficient. Even on plain SSE2, you can do with unpack + sub + muladd + shuffle + add and then some scalar cleanup (plus the loads, of course). icx looks to be, what, 40 SIMD instructions?
find_unmatched is just scalar code cosplaying SIMD; 150 instructions and most of them do one byte at a time.
find_special seemingly generates _thousands_ of instructions! What you need, for each 16-byte loop, is a pshufb + compare + pmovmskb (0x80.pl is down now, but IIRC it's explained there). It helps that all of the values in question differ in one of the nibbles.
I am not that convinced by this being a usable universal technique :-) Even after switching back to supporting AVX512, the generated code is a huge mess. Seemingly these three functions together clock in at 6+ kB, which is ~10% of your entire icache.
> Sounds like you are working on some kind of parser?
Yeah, I wouldn't use this type of thing for these problems, this was a huge mess. But I think code designed for autovec is resonable as a scalar base implementation for a larger set of problems than people think.
I've seen the problem before, where people explicitly vectorized (sse) something, but the code could be autovectorized (avx) and outperformed the explicitly vectorized one, because the compiler could take advantage of newer extensions.
You really should be able to use a ISPC like subset in your C code, OpenMP goes into the right direction, but it's not designed with SIMD as the highest priority.
Yes, I'd say that the strongest part of autovectorization is that you can get more-or-less automatic support for wider/newer instruction sets than what you had when you wrote the code; older SIMD, like all other code, tends to rot. Of course, this is predicated on either having function multiversioning (works, but very rare) or being able to frequently raise the minimum target for your binary.
reply