More

benob · 2025-09-23T19:23:57 1758655437

Like it or not, LLMs are effectively high-order Markov chains

BenoitEssiambre · 2025-09-23T19:35:59 1758656159

Exactly. I think of them as Markov Chains in grammar space or in Abstract Syntax Tree space instead of n-gram chain-of-words space. The attention mechanism likely plays a role in identifying the parent in the grammar tree or identifying other types of back references like pronouns or if it's for programming languages, variable back references.

benob · 2025-09-23T20:13:23 1758658403

There was a time when people would estimate n-gram probabilities with feed-forward neural networks [1,2]. We just improved that with the (multilayer) attention mechanism which allows for better factoring over individual tokens. It also allowed for much larger n.

[1] https://jmlr.org/papers/volume3/bengio03a/bengio03a.pdf

[2] https://www.sciencedirect.com/science/article/abs/pii/S08852...

krackers · 2025-09-23T22:13:43 1758665623

Only in the way that every real-world computer is a finite state machine.

bigfishrunning · 2025-09-24T01:54:41 1758678881

Every real-world computer is a finite state machine, there's just a lot of states. I guess I'm unsure of the point that you're trying to make here

red75prime · 2025-09-24T08:03:08 1758700988

The point? It's an almost useless way of looking at things.

guluarte · 2025-09-23T19:26:24 1758655584

markov chains with limited self correction

benob · 2025-09-23T15:22:51 1758640971

These techniques are limited to structures that can be checked with bounded history or bounded memory (that can be checked with a grammar or FSA). What about more complex structures that don't factor easily?

benob · 2025-09-20T15:12:24 1758381144

My understanding was that the Alpaca data was a distillation from text-davinci-003

benob · 2025-09-18T12:47:29 1758199649

a good benchmark for video understanding in IA

benob · 2025-09-16T11:13:30 1758021210

It's funny because automation is the only thing you can expect from AI

benob · 2025-09-09T06:21:31 1757398891

> YouTube views seem to have fallen off a cliff recently

So they started discounting AI data collection bots?

SchemaLoad · 2025-09-09T06:24:58 1757399098

Youtube pretty aggressively blocks automated usage now. If you connect via a VPN it won't show any videos until you log in. Considering that ad income and real engagement seems unchanged, it possibly is just that they started blocking bots better. Something you wouldn't strictly expect a public announcement over.

chii · 2025-09-09T06:30:42 1757399442

I also find that they block incognito mode a lot too - esp. on android when you use a hacked client like revanced (less so in the browser, but i use firefox and have seen incognito mode causing youtube to block you from viewing without logging in).

benob · 2025-09-04T07:57:59 1756972679

Stop thinking about text being the data. There are so many other sources, even some that you can generate.

https://arxiv.org/pdf/2506.20057

JumpCrisscross · 2025-09-04T08:04:48 1756973088

> Stop thinking about text being the data

Path #2 in TFA.

benob · 2025-08-27T10:40:13 1756291213

Even if it's not on topic, that post is quite interesting.

benob · 2025-08-01T07:31:58 1754033518

How is it different from mere syntactic sugar over the same programming concepts? What does it bring that C++ cannot do?

Isn't it just a way of controlling the language vs using normative bodies?

cjj_swe · 2025-08-05T19:39:50 1754422790

Well for one, in addition to templates it has definition checked generics. That alone justifies it to me, but it's far from the only improvement.

benob · 2025-07-23T09:47:30 1753264050

Are there efforts to include the neccessary context in compilers to autovectorize?

yvdriess · 2025-07-23T10:09:14 1753265354

What do you mean with necessary context?

Modern compilers all autovectorize really well. Usually writing plain canonical loops with plain C arrays is a good way to write portable optimal SIMD code. The usual workflow I use is to translate the vector notation (RIP Cilk+ array syntax) in my paper notes to plain C loops. The compiler's optimization report (-qopt-report for icx, gcc has -fopt-info-vec and -fopt-info-vec-missed) gives feedback on what optimizations it considered and why it did not apply them. In more complex scenarios it can be helpful to add `#pragma omp simd` pragmas or similar to overrule the C semantics.

Sesse__ · 2025-07-23T12:54:16 1753275256

> Modern compilers all autovectorize really well. Usually writing plain canonical loops with plain C arrays is a good way to write portable optimal SIMD code.

I don't think I've seen this happen once at work (always using recent Clang) the last… three years? And I've written a fair bunch of SIMD code.

Autovectorization works fairly well when all of your code is very vertical (no interference between the elements) _and_ the compiler knows that your N is divisible by some pretty number (or else, that code bloat is OK enough that it can unroll and deal with the tail). And in the occasional very standard horizontal case (e.g., sum a bunch of bytes, N % 16 == 0). Otherwise, it's just a sad, sad story IME. For most of the algorithms in the presentation at hand, the compiler will have no chance at autovectorizing them.

camel-cdr · 2025-07-23T15:50:27 1753285827

I think you can coax most of them into autovectorizing via openmp and a bit of restructuring. Can you give 1-3 examples of these type of problems where you think autovectorization is lacking? I want to give it a try.

Sesse__ · 2025-07-23T17:36:45 1753292205

Let me give you the last couple of routines that I was involved in vectorizing, no cherry-picking:

  - Skip to the first unmatched } (i.e., if there's two { in there, you are looking for the third }), ignoring anything inside "", and that any character may be escaped using a backslash. (I've simplified away a bunch of complicating cases, or it would just sound unreasonable. The optimal path depends on whether you have e.g. PCLMUL.)
  - Skip to the next <, \r, \0 or & (note that the optimal path on SSSE3+ uses pshufb, or similarly VTBL on NEON) and return its position. A typical case is that it's ~10 bytes away, but it could be kilobytes.
  - Parse the decimal part of a double, given that only accuracy up to 1e-7 is needed. (The optimal solution differs somewhat between NEON and SSE2.)

The second one is the only one where I believe that autovectorization could have reasonably done a fair job, since it's so standard (but it didn't in practice).

camel-cdr · 2025-07-23T20:57:15 1753304235

So I think I managed to do all of the above with varying degrees of success: https://godbolt.org/z/WY99vxs76

* parse_fract: I got that one down to 23 instructions on icx, although gcc took 81 and clang 91

Since both of the other two return an index, I decided to keep it simple and use a 128 iteration inner loop that accumulates the index into a 8-bit integer, so I don't have to widen. 128 instead of 256, because I needed a sentinel value.

* find_unmatched: obviously the compiler couldn't figure out the clmul trick. icx: 0.86 instr/byte, gcc: 0.625 instr/byte, clang at failed to vectorize the +-scan.

* find_special: The LUT didn't end up working that well, so I'm doing the four comparisons separately. icx: 0.45 instr/byte, gcc: 0.30 instr/byte, clang: 0.25 instr/byte

(I used znver5 as the target for gcc and clang, but znver4 for icx)

These were more painful to do than need be, somebody should try it with ISPC and see how that compares.

I didn't know about the inclusive scan support in OpenMP before writing this. It's almost good, but the implementations are slightly buggy, and it seems to be designed with threading, not SIMD, in mind. In the sense that you have to write the scan into an array, while in SIMD you don't need that, in multi-threading you need the buffer to do a scan-tree-reduction.

The other problem is early exit loops, which should totally be permissible. icc also had support for early_exit, but icx doesn't support it anymore. Wouldn't you "just" need to do an or reduction on the condition mask and break if one bit was set?

Thanks for the suggestions. Sounds like you are working on some kind of parser?

Sesse__ · 2025-07-23T22:17:06 1753309026

So I looked at these briefly without AVX512, because only a tiny fraction of people have anything like that and the claim was that this would be a great way of making portable SIMD :-) Also, obviously you cannot use -ffast-math in real code.

parse_fract seems really, really inefficient. Even on plain SSE2, you can do with unpack + sub + muladd + shuffle + add and then some scalar cleanup (plus the loads, of course). icx looks to be, what, 40 SIMD instructions?

find_unmatched is just scalar code cosplaying SIMD; 150 instructions and most of them do one byte at a time.

find_special seemingly generates _thousands_ of instructions! What you need, for each 16-byte loop, is a pshufb + compare + pmovmskb (0x80.pl is down now, but IIRC it's explained there). It helps that all of the values in question differ in one of the nibbles.

I am not that convinced by this being a usable universal technique :-) Even after switching back to supporting AVX512, the generated code is a huge mess. Seemingly these three functions together clock in at 6+ kB, which is ~10% of your entire icache.

> Sounds like you are working on some kind of parser?

All of these are for HTML/CSS, yes.

camel-cdr · 2025-07-23T22:37:11 1753310231

Yeah, I wouldn't use this type of thing for these problems, this was a huge mess. But I think code designed for autovec is resonable as a scalar base implementation for a larger set of problems than people think.

I've seen the problem before, where people explicitly vectorized (sse) something, but the code could be autovectorized (avx) and outperformed the explicitly vectorized one, because the compiler could take advantage of newer extensions.

You really should be able to use a ISPC like subset in your C code, OpenMP goes into the right direction, but it's not designed with SIMD as the highest priority.

Sesse__ · 2025-07-24T08:04:49 1753344289

Yes, I'd say that the strongest part of autovectorization is that you can get more-or-less automatic support for wider/newer instruction sets than what you had when you wrote the code; older SIMD, like all other code, tends to rot. Of course, this is predicated on either having function multiversioning (works, but very rare) or being able to frequently raise the minimum target for your binary.

janwas · 2025-07-24T08:24:46 1753345486

You also get the automatic support for newer instructions (and multiversioning) with a wrapper library such as our Highway :)