Re2c

fanf2 · on Feb 22, 2024

Some years ago, I heard of several projects using the ragel state machine compiler for parsing network protocols http://www.colm.net/open-source/ragel/. It sounded great, but I found it was too low-level for me: it seemed that to use it effectively, I would have to attain oneness with the state machine, whereas one of the great advantages of regular expressions is that I don’t have to think (much) about how they are implemented.

On the other hand, for a small project I wanted to wire up a C library (adns) to a simple parser, but adns lacks bindings to higher-level languages. So I thought, why not go retro and use the classic Unix swiss army knife, lex [•]. And it was actually a lot nicer than I expected, and the results were great.

More recently I used re2c for a C/C++ lexer. Its user interface is fairly similar to lex, but cleaned up for modern tastes – in fact, re2c is easier and more flexible than lex, I think. And like ragel, re2c produces a directly-coded state machine, instead of a table-driven state machine. So the resulting code is (generally speaking) bigger but faster than lex.

I think re2c is a great tool, and its developers are helpful and responsive.

[•] perl was referred to as a unix swiss army chainsaw, because where lex is regexes + C, perl is regexes + scripting, and perl allows you to go from idea to huge mess much faster

chubot · on Feb 22, 2024

Hm as an re2c user, I always thought that Ragel vs. re2c was a push/pull issue (though re2c flag has a --storable-state flag to generate the push style -- I haven't used it)

But now that I look at the example on the front page: http://www.colm.net/open-source/ragel/

    action dgt      { printf("DGT: %c\n", fc); }
    action dec      { printf("DEC: .\n"); }

    number = (
        [0-9]+ $dgt ( '.' @dec [0-9]+ $dgt )?

I don't like how the dgt and dec actions are "inline" in the regex.

This seems to be breaking the key abstraction of regexes, which in my mind is "NON-DETERMINISM for free".

The better mechanism is submatch extraction, which re2c supports, and they even wrote a 2017 paper about - https://arxiv.org/abs/1907.08837 , also see https://re2c.org/

(I don't use re2c submatch extraction in https://www.oilshell.org, but I just used in a brand new program, and it works great. Very easy and natural.)

So I think Ragel is doing something Perl-ish and recasting regular languages as imperative. But that can be asymptotically slower and breaks reasoning by sets, forcing you to reason by stepping through code.

ulrikrasmussen · on Feb 22, 2024

Check out Kleenex which also allows you to inline actions in the (regular) grammar, but which promises to only execute them when it has resolved all non-determinism for the input seen so far: https://kleenexlang.org/

The compiler translates regular grammars with embedded semantic actions to fast C code.

Full disclosure: I am one of the authors.

megous · on Feb 22, 2024

A beautiful thing. If my C program is working with strings in any capacity larger than some copying around, this is what I use. No reason to muck around with strsep, strpbrk, etc. when you can use re2c and a few gotos to implement pretty much any matching pattern safely. Especially useful for dealing with text protocols. Making hand-written recursive descent parsers is also a breeze with this as a building block.

mattgreenrocks · on Feb 22, 2024

One fun project in this vein is to DIY something similar to this. To simplify things initially, you can use NFAs, along with an existing library to parse the regex syntax yourself.

The aha moment comes when you see how regex syntax compiles down to various configurations of automata. Couple that with the fact that automata are made to be composed together well, and the result is beautiful in a way that you rarely see in production code.

Here's my stab at it in Rust: https://github.com/mattgreen/lexer/tree/master/src/nfa

mikhailfranco · on Feb 23, 2024

Concurrent parallel execution of NFA directly in Elixir:

https://github.com/mike-french/myrex

It is concurrent in all senses: a single match is split into many concurrent traversals of the network; multiple inputs can be matched concurrently within the same network; and string generators can also run concurrently in the network.

That's possible because all state is in the traversal messages, not in the process nodes, and the whole thing runs asynch (non-blocking) in parallel, automatically using all cores in the machine.

> you see how regex syntax compiles down to various configurations of automata

That is Thompson's Construction [1]. The myrex README contains a long description of how regex structures map to small process networks, and how they glue together. The final process network is almost a direct 1-1 representation of the NFA.

[1] Russ Cox has a nice explanation https://swtch.com/~rsc/regexp/regexp1.html

notachatbot123 · on Feb 22, 2024

> re2c is a free and open-source lexer generator for C/C++, Go and Rust with a focus on generating fast code.

ginko · on Feb 22, 2024

I often wonder if the existence of lexer and parser generator tools like this is indicative that common general purpose languages are lacking the expressiveness to implement things like this directly.

I feel like parsing is common enough that this could be a language feature or a core library making use of higher-level language functionality.

whizzter · on Feb 22, 2024

It's historically a regular expression compiler for C (hence re2c), C (and C++ until recently) didn't include a Regular Expression engine.

I used an earlier version of re2c for one project since it was a far easier to get started with compared to lex/flex (that iirc also wasn't object local/thread safe in the past).

Perl, Javascript, Java, C#,etc all include regexp engines these days so you can quickly build an abstraction that provides roughly what re2c gives.

One huge benefit of re2c style generators still have however is that they're precompiled and quite efficient compared to generic regexp libraries if performance matters.

If you want something like re2c with the same performance the target language would first require macro or precompiled facilities (iirc there are C++ regexp systems now that are compile-time compiled and should have similar performance to re2c).

fanf2 · on Feb 22, 2024

C on Unix and POSIX has had regular expressions in the standard library since the 1980s https://pubs.opengroup.org/onlinepubs/007908799/xsh/regex.h....

whizzter · on Feb 22, 2024

Still not the language, would be interesting to know why someone pushed this to POSIX and didn't propose it for the language.

Today Windows is the biggest non-POSIX system remaning afaik (apart from Android and consoles) but back in the 80s and 90s there was MacOS, STOS, AmigaOS,etc so saying it was in Unix/POSIX has little connection to the language at large.

ginko · on Feb 22, 2024

> If you want something like re2c with the same performance the target language would first require macro or precompiled facilities.

Yeah, I think you could build something like that in Zig with comptime.

electroly · on Feb 22, 2024

C# has compile-time precompiled regular expressions now, too. The enabling C# feature is source generators.

neonsunset · on Feb 22, 2024

They are extremely fast too (way faster than re2): https://github.com/BurntSushi/rebar?tab=readme-ov-file#summa...

whizzter · on Feb 23, 2024

NOTE re2 is not the same as re2c here

re2 is Google's safe regexp engine that works as a state-machine and isn't suspectible to exponential runtime attacks that can affect backtracking engines, it trades some performance for guaranteed non-fatal worst case scenarios.

the C# compiled regexps is backtracking (fast by default) JIT compiled searchers so it's extremely fast (on the level of source-compiled matchers).

burntsushi · on Feb 23, 2024

Note that .NET has a non-backtracking engine now too. It's included in the rebar measurements.

whizzter · on Feb 23, 2024

Yeah noticed, just pointing out that it's not 2x faster (still faster though and that's impressive!).

I implemented an basic DFA based engine like re2 once as an exercise and there was a performance cost (and building it cost a bit) so I wonder if they JIT non backtracking variant also?

burntsushi · on Feb 23, 2024

Not sure. But rust/regex (of which I am the author) is above pcre2/jit in rebar's measurements. Hyperscan is too. All of rust/regex, Hyperscan and RE2 use finite automata, no JIT and no backtracking.

There is more to the perf of a regex engine than whether it uses a JIT or not. :)

Many of the curated rebar benchmarks (in the README) do not just list numbers. They also try to contextualize those numbers. (I am only one person though with limited expertise, so they tend to bias toward what I know. But still, it's miles better than what you'll see in most benchmarks IMO.)

And of course, make sure to read the BIAS document.

burntsushi · on Feb 22, 2024

Note that rebar doesn't include C#'s compile time regexes in the benchmark. The only reason it doesn't is because nobody has implemented it yet. In rebar's model, it will require writing a C# program that compiles another C# program. It's pretty annoying. It's the same reason why CTRE isn't in rebar either. There's nothing preventing it other than the work to make it happen.

neonsunset · on Feb 22, 2024

This is correct but both source-generated and runtime-compiled regexes should be within the same ballpark performance-wise (if somewhat favoring the former) therefore it's worth a mention :)

burntsushi · on Feb 23, 2024

Oh yeah absolutely. Sorry, I only meant to give a very narrow clarification. :-) I just didn't want folks thinking that rebar was providing measurements for C# compile time regexes. You're absolutely right that that might mean compile time regexes are even faster than what rebar reports.

chubot · on Feb 22, 2024

After using re2c, I actually think it should have been built into C. Why?

- Because Ken Thompson was co-inventor of C, and he ALSO brought regular languages to computing. IIRC he basically wrote the first version of grep, and compilers like re2c use the same automata-based methods as the original grep.

https://www.oilshell.org/archive/Thompson-1968.pdf

- Because C is really bad at string processing. Regular languages would be a huge improvement. Looking through the implementation of most shells should convince of you that.

Thompson also wrote the very first Unix shell, of course! https://www.oilshell.org/blog/2021/08/history-trivia.html

whiterknight · on Feb 22, 2024

Different tasks are best served by a different mode of evaluation, not just a library. This is why the lisp community advocates a many DSL design approach.

megous · on Feb 22, 2024

Indeed, it would be great as a language feature.

emmanueloga_ · on Feb 22, 2024

See Raku's grammars [1], probably the most direct language support for parsing in a mainstream language (errr... or lang that _wants_ to be mainstream :-).

--

1: https://docs.raku.org/language/grammars

dist1ll · on Feb 22, 2024

I suspect that lexer generators mostly exist because that's what people who want to implement compilers are being taught. But there's nothing difficult about writing a lexer. It can be done in less than 100 LoC, easily half that if your language has good string ergonomics and support for pattern matching.

remexre · on Feb 22, 2024

Wouldn't the hand-written one be more work to write and measurably slower than the re2c-generated one?

dist1ll · on Feb 22, 2024

It's really not much work, and it removes another source of dependency/build bloat. Also, if you care about speed, hand-rolling your lexer will always be the fastest (and even naive implementations can rival/outcompete generators).

ceronman · on Feb 22, 2024

Not only that, but a manually written lexer has much more flexibility. It allows you to easily implement things that lexer generators struggle with. Think of significant indentation, nested comments, interpolation inside string literals. Sometimes you can't do this at all with generators, sometimes you can, but code becomes a mess. That's the reason why many popular programming language implementations use a hand written lexer and parser.

fanf2 · on Feb 22, 2024

re2c makes it easy to drop into hand-written code where necessary. For example, in C backslash-newline makes a huge mess of everything unless you can handle it underneath the lexer, which works nicely in re2c. In C++, raw string literals are not even context-free, let alone regular, so they have to be parsed with hand-written code. This is straightforward with re2c.

o11c · on Feb 22, 2024

The problem with hand-written lexers/parsers (and some generated parsers, notably PEG these days) is that you have no idea if you can actually trust it.

There are so many weird edge cases for lexing and parsing that I would never dare to do it without machine verification.

tsujamin · on Feb 22, 2024

Ran across some .re2c generated functions in some software I was reverse engineering recently. It probably speaks more to the compiler/source used but the generated code was incredibly verbose (400 odd nested error handling branches).

On the upside, it generated functions too long to decompile, so that’s a handy feature I guess?

UncleEntity · on Feb 22, 2024

I use it for a scheme lexer with full(?) unicode support -- just for science I turned off the 'case-ranges' flag and it produces a lex function that is > 27,000 loc.

sram1337 · on Feb 22, 2024

I like to think this is pronounced "reducey"

414owen · on Feb 22, 2024

re2c works like a charm. I'm generally confident that I couldn't write faster scanners by hand if I tried.

jakearmitage · on Feb 22, 2024

It sucks that re2c simply can't parse indentation-based formats like YAML or Python. Had to resort to nom to be able to do it.

chubot · on Feb 22, 2024

This is sort of a category error... re2c is a lexer generator, and YAML and Python are recursive/nested formats.

You can definitely use re2c to lex them, including indentation, but it's not the whole solution. You need a parser too.

I use it for everything possible in https://www.oilshell.org, and it's amazing. It really reduces the amount of fiddly C code you need to parse languages, and it drops in anywhere.

Parser generators usually have some downside, but there's not much downside to the lexer generators IMO. It just saves you work. And the regex syntax is better than Perl/Python because literals are consistently quoted.

chubot · on Feb 22, 2024

Note: rather than "lexer generator", "regular language to state machine compiler" is a better description.

Lexers can use re2c, but it's not even the whole story. This is good because it means it's "policy free" and you can use it anywhere.

ww520 · on Feb 22, 2024

Re2c only performs the tokenizing part, not the parsing part. Re2c should be able to run a regex to recognize N spaces and produces a n-space token. It will be up to the parser to use that to get the indentation of a statement.

junon · on Feb 22, 2024

This is what the Ninja build system uses, by the way. Incredibly fast.

kunley · on Feb 22, 2024

Just why is it called re2c for Go and Rust, where the actual tool is re2go or re2rust, but you still use re2c in the syntax?

It's of course not beyond the intelectual capabities of users, but just looks kind of silly. Could have been rebranded IMO when new languages were introduced