More

glangdale · 2025-06-14T00:55:31 1749862531

Yes, that's it. Vectorized SIMD annihilates this problem, a space I've been working in since 2006 and it wasn't all that new even then. A close second would be a heavily optimized (pipelined and less branchy) table or bitvector lookup. Doing anything that involves lots of control flow, like the grandparent post, will be slow as a wet week with our without bit manipulation tricks due to the inherently unpredictable nature of the branches (subject to our input).

glangdale · 2025-06-14T00:20:27 1749860427

To paraphrase Arthur Dent, "this must be a strange new usage of the word 'fastest' with which I am not previously familiar".

The fastest way to detect a vowel in a string on any reasonable architecture (Intel or AMD equipped with SIMD of some kind) is using 3-4 instructions which will process 16/32/64 (depends on SIMD length) bytes at once. Obviously access to these will require using a Python library that exposes SIMD.

Leaving SIMD aside, a flat byte array of size 256 will outperform a bitmap since it's always faster to look up bytes in an array than bits in a bitmap, and the size is trivial.

windward · 2025-06-26T13:52:36 1750945956

I don't think the size is trivial, it's 4 times the cache length on x64

glangdale · 2025-07-01T08:03:33 1751357013

I'm not sure what you mean here. A single cache line is 64B, and this table would thus occupy 4 cache lines, but typical x86 cache sizes are 32K or for more recent cores 48K. Whether consuming 1/512th or 1/768th of your level 1 cache is excessive is a value judgement, but most people wouldn't think so.

glangdale · 2025-01-15T00:57:52 1736902672

100% this. Our kids were required to bring laptops to school for no particularly good reason, then allowed to zombie out on them in the library during lunch and free periods. Infuriating.

dchichkov · 2025-01-15T01:15:19 1736903719

I understand that it is mostly regulated at the state level. I'm not sure about other states, but The Computer Science Standards for California Public Schools (Kindergarten through Grade Twelve) also tend to be followed by private schools. So they can claim their programs meet state requirements.

This brings computers into the classroom, and once they’re available, it is a slippery slope. It is easier for teachers to have students use semi-gamified "educational" apps rather than engage themselves.

Example for K-2 - https://www.cde.ca.gov/be/st/ss/documents/csstandards.pdf:

  K-2.CS.1 Select and operate computing devices that perform a variety of tasks accurately and quickly based on user needs and preferences.

  K-2.CS.2 Explain the functions of common hardware and software components of computing systems.

  K-2.CS.3 Describe basic hardware and software problems using accurate terminology.

  K-2.NI.4 Model and describe how people connect to other people, places, information and ideas through a network.

  ...

  K–2 K-2.AP.12 Create programs with sequences of commands and simple loops, to express ideas or address a problem

  K-2.IC.20 Describe approaches and rationales for keeping login information private, and for logging off of devices appropriately

glangdale · 2025-01-15T02:22:32 1736907752

Yes, we have similar metrics in NSW (Australia). Agreed on the dynamics. There are also a lot if fairly feral edutech entrepreneurs playing special interest capture here - they obviously care more about selling their dubious education novelties than any one group cares about keeping them out. So our kids' schools are littered with semi-functioning "smart whiteboards" and a host of broken edutech apps.

glangdale · 2025-01-15T00:53:47 1736902427

I'm sure you are a great person and all that, but in my experience, this particular recipe has produced absolute legions of smug, arrogant people who are nowhere near as smart as they think they are. Many of these people were dangerously unprepared for a world where they weren't the smartest person in the room in a not-very-smart room.

glangdale · 2025-01-13T21:59:23 1736805563

It's magnificently Paul Graham that he wrote some incredibly long essay called "The Origins Of Wokeness" without ever discussing, the origins of wokeness. Whatever you think about the current situation of "wokeness", the fact that pg manages to never once mention the origin of the term, going back to Marcus Garvey and Leadbelly, speaks to pg's monumental intellectual incuriosity.

glangdale · 2024-12-22T07:17:14 1734851834

Not OP, but we use these tables all the time in Hyperscan (for string and character class matching) and it's a long-standing technique to use it for things like SIMD population count (obsoleted now if you have the right AVX-512 ISA, ofc).

glangdale · on June 13, 2024

8 instructions seems very solid - guessing AND/PSHUFB for low nibble, SHIFT/AND/PSHUFB for high nibble, OR to combine plus load/store?

If you have AVX-512, GFNI is faster for this task, but obviously many situations where you can't use it.

Const-me · on June 13, 2024

> guessing AND/PSHUFB for low nibble, SHIFT/AND/PSHUFB for high nibble, OR to combine

Yeah, that’s exactly what I did in my C++ code with intrinsics.

About ISA extensions, I’m lucky to work on a professional CAM/CAE software. We have specified AVX2 in the system requirements, I’m guaranteed to have the support on our customer’s computers. However, very few of them have AVX512 CPUs so we are ignoring that thing so far.

glangdale · on March 25, 2024

This would be, speaking as someone who has dealt with his share of state machines, quite confusing. The uncomfortable thing about regular expression implementation as automata is either (a) non-determinism (i.e. "being in many states at once" a la Glushkov or Thompson NFAs, not "non-determinism meaning something different might happen on any given execution") or (b) state explosion (in a DFA, to represent non-determinism).

This has a huge impact on trying to hook actions, call-backs, etc (your "arbitrary procedure") to NFA states as you're frequently making many overlapping entries to these states, many of which go nowhere. Trying to figure out which entries correspond to which other entries isn't easy.

There are very stylized automata that are used in parsing that do interesting things beyond the finite automata space (like pushing things onto a stack), but they don't correspond to regex per se - instead they are generated from a grammar.

rhelz · on March 27, 2024

re: nondeterministic implementations: Detecting and resolving nondeterminism is something every method of implementing state machines has to consider. Even when implementing a state machine by hand coding switch statements, you have to worry about whether or not you have inadvertently specified an indeterminism.

> interesting things beyond the finite automata space

State machines certainly can be glorified into LR parsers, or even into Turing machines by adding various bells and whistles. So sure, this idea shouldn't be limited to just regular expressions--we've got good methods of specifying even very complicated state handling.

I hope you saw the excellent comment on this thread where somebody talked about how Ken Thompson implemented paxos using yacc for state management. I've been around the block with state management too, but using yacc for state management is something that never occurred to me, and frankly, opens up a whole new world of possibilities.

BTW, it's also an answer to your point about the pitfalls of nondeterminism: yacc is a great example of how to specify a state machine while detecting, reporting, and resolving nondeterminism.

glangdale · on Jan 20, 2023

I always thought it was an odd turn of phrase; at most, you might have your speed detected by aircraft. "Enforced" always conjured up a picture of speed demons being strafed from above until they slowed down; maybe it's just me?

philwelch · on Jan 20, 2023

Yeah, there’s a very old meme that’s a photo of one of those signs with an attack helicopter lurking behind it.

glangdale · on Dec 5, 2021

I started to build "simdcsv" - based on similar principles to "simdjson", but succumbed to lassitude. There's still a fork kicking around.

Doing the CSV version of the quote convention is actually wildly easier than the JSON version, as you can just treat a record like "foo""bar",hi,1,"hatstand,teakettle" as having 'left the quotes and rentered the quotes" at the double-quote spot when you're busy looking for ','. This isn't, of course, much help for normalization, but for the bit where you're hoping to simply find which ',' characters are separators, it's fine to pretend that there's a gap in your "quoted stuff bitfield" that happens at the "" in "foo""bar", as of course that gap isn't going to land on a ',' anyhow. So it's much cheaper than doing a tedious shenanigan to handle some crafty user who has hit you with 100 \ characters in a row (as opposed to 99 or 101), a la JSON.

IMO the cost of doing the CLMUL all the time vs taking a conditional to handle the 'I have no quotes today' case is pretty low. It makes for shorter and more easily understood (in performance terms) code. I wasn't allowed to take this to extremes in simdjson - we did handle the "common case" for UTF-8 validation (although I wonder if that's a very culturally determined notion of "common case").