If that is the only incompatibility, it would be easy to make a patch that check...

burntsushi · on March 9, 2021

It's not even close to the only incompatibility. :-) That's a nominal one. If that were really the only thing, then sure, I would provide a way to make ripgrep work in a POSIX compatible manner.

There are lots of incompatibilities. The regex engine itself is probably one of the hardest to fix. The incompatibilities range from surface level syntactic differences all the way down to how match regions themselves are determined, or even the feature sets themselves (BREs for example allow the use of backreferences).

Then of course there's locale support. ripgrep takes a more "modern" approach: it ignores locale support and instead just provides what level 1 of UTS#18 specifies. (Unicode aware case insensitive matches, Unicode aware character classes, lots of Unicode properties available via \p{..}, and so on.)

Sebb767 · on March 9, 2021

Pity! I did look; only "-E" and "-s" diverge from the POSIX standard parameter-wise. But making significant changes to the pattern engine is probably not worth it.

Thanks anyway, I enjoy rg quite a lot :)

nicoburns · on March 9, 2021

It's worth noting that the implementation of ripgrep has been split up into a whole bunch of modular components. So it wouldn't be out of the question for someone to piece those together into a GNU-compatible grep implementation.

orra · on March 9, 2021

True, though is there any point? ripgrep's homegrown regex engine only supports true regular expressions.

To give backreference support, ripgrep can optionally use PCRE. But PCRE already comes with its own drop in grep replacement...

burntsushi · on March 9, 2021

To the extent that you want to get a POSIX compatible regex engine working with ripgrep, you could patch it to use a POSIX compatible regex engine. The simplest way might be to implement the requisite interfaces by using, say, the regex engine that gets shipped with libc. This might end up being quite slow, but it is very doable.

But still, that only solves the incompatibilities with the regex engine. There are many others. The extent to which ripgrep is compatible with POSIX grep is that I used flag names similar to GNU grep where I could. I have never taken a fine toothed comb over the POSIX grep spec and tried to emulate the parts that I thought were reasonable. Since POSIX wasn't and won't be a part of ripgrep's core design, it's likely there are many other things that are incompatible.

A POSIX grep can theoretically be built with a pretty small amount of code. Check out busybox's grep implementation, for example.

While building a POSIX grep in Rust sounds like fun, I do think you'd have a difficult time with adoption. GNU grep isn't a great source of critical CVEs, it works pretty well as-is and is actively maintained. So there just isn't a lot of reason to. Driving adoption is much easier when you can offer something new to users, and in order to do that, you either need to break with POSIX or make it completely opt-in. (I do think a good reason to build a POSIX grep in Rust is if you want to provide a complete user-land for an OS in Rust, perhaps if only for development purposes.)

Sebb767 · on March 9, 2021

Well, the reasons I see for being POSIX-compatible would be:

1. Distributions could adopt rg as default and ship with it only, adding features at nearly no cost

2. The performance advantage over "traditional" grep

Number 1 is basically how bash became the default; since it is a superset of sh (or close enough at least), distributions could offer the feature set at no disadvantage. Shipping it by default would allow scripts on that distribution to take advantage of rg and, arguably, improve the situation for most users at no cost.

If one builds two programs in one with a switch, you're effectively shipping optional software, but in a single binary, which makes point 1 pretty moot. If you then also fall back on another engine, point 2 is moot as well - so the only point where this would actually be useful is if rg could become a good enough superset of grep that it would provide sufficient advatages (most greps _already_ provide a larger superset of POSIX, though). Everything else would just add unnecessary complexity, in my opinion.

But it would have been nice :)

burntsushi · on March 9, 2021

Ah I see. Yeah, that's a good point. But it's a very very steep hill to climb. In theory it would be nice though. There's just a ton of work to do to hit POSIX compatibility and simultaneously be as good at GNU grep at other things. For example, the simplest way to get the regex engine to be POSIX compatible would be to use an existing POSIX compatible regex engine, like the one found in libc. But that regex engine is generally regarded as quite slow AIUI, and is presumably why GNU grep bundles it's entire own regex engine just to speed things up in a lot of common cases. So to climb this hill, you'd either need to follow in GNU grep's footsteps _or_ build a faster POSIX compatible regex engine. Either way, you're committing yourself to writing a regex engine.

emidoots · on March 9, 2021

I didn't look closely, but Oniguruma is pretty dang fast and has drop-in POSIX syntax + ABI compatability as a compile-time option. Could maybe use that.

burntsushi · on March 10, 2021

The regex engine I maintain includes benchmarks against onig. It's been a couple years since I looked closely, but last I checked, onig was not particularly fast. Compare https://github.com/rust-lang/regex/blob/master/bench/log/07/... vs https://github.com/rust-lang/regex/blob/master/bench/log/07/...

emidoots · on March 10, 2021

Ahh, very interesting, thanks for sharing! Do you have any thoughts around why that is? I presume that's due to Oniguruma supporting a much broader feature set and something like fancy-regexp's approach with mixing a backtracking VM and NFA implementation for simple queries would be needed for better perf? (I am aware you played a role in that) [1]

I have been playing around with regex parsing through building parsers through parser combinators at runtime recently, no clue how it will perform in practice yet (structuring parser generators at runtime is challenging in general in low-level languages) but maybe that could pan out and lead to an interesting way to support broader sets of regex syntaxes like POSIX in a relatively straightforward and performant way.

[1] https://github.com/fancy-regex/fancy-regex#theory

burntsushi · on March 10, 2021

No idea. I've never done an analysis of onig. Different "feature sets" tends to be what people jump to first, but it's rarely correct in my experience. For example, PCRE2 has a large feature set, but it is quite fast. Especially its JIT.

The regex crate does a lot of literal optimizations to speed up searches. More than most regex engines in my experience.