Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> If the string to be matched against contains 20,000 space characters in a row, but not at the end, then the Regex engine will start at the first space, check that it belongs to the \s character class, move to the second space, make the same check, etc. After the 20,000th space, there is a different character, but the Regex engine expected a space or the end of the string. Realizing it cannot match like this it backtracks, and tries matching \s+$ starting from the second space, checking 19,999 characters. The match fails again, and it backtracks to start at the third space, etc.

That's not how backtracking works. A regex engine will only backtrack to try and make the rest of the regex match, i.e. it will take characters of the RHS of the string, not try and start "from the second character off the start of the string". I mean, if the engine tried matching from the second space, what would be matching the first space? Something has to.

Which meant, that even if the regex engine was incredibly stupid and could not figure out that a greedy block of \s was never going to contain a non-\s, it would only have to check 20,001 times, not 199000 (or whatever it was).

I can't reproduce this "bug" in either Perl or Python. The time taken to match a 30,000 block of space either followed by $ or XX$ was basically identical for \s+$.

There does appear to be normal backtracking going on, roughly doubling the search time for large strings terminating in non-\s. This is expected, as it has to check 20,000 during the first gobble, then 20,000 as it backtracks from the right 20,000 times.

    $ time perl -e '(" " x 100000000 . "X") =~ /\s+$/ && print "MATCH"'

    real	0m0.604s
    user	0m0.509s
    sys	0m0.094s

    $ time perl -e '(" " x 100000000) =~ /\s+$/ && print "MATCH"'
    MATCH
    real	0m0.286s
    user	0m0.197s
    sys	0m0.089s


> I mean, if the engine tried matching from the second space, what would be matching the first space? Something has to.

Some regex engines provide an API call that puts an implicit `.STAR?` at the beginning of the regex so that the semantics of the match are "match anywhere" as opposed to "match only from the start of the string." (This is in fact the difference between Python's `match` and `search` methods.) Assuming the OP was using this type of method, then this difference exactly explains why you can't reproduce it. I can:

    >>> import re
    >>> haystack = (' ' * 20000) + 'a'
    >>> re.match('\s+$', haystack) <-- is wicked fast
    >>> re.search('\s+$', haystack) <-- chugs a bit
indeed, this chugs a bit too:

    >>> re.match('.*?\s+$', haystack)
So technically, the OP left out this little detail, but it's a pretty common thing to find in regex libraries. In fact, Rust's library treats all searches as if they were re.search and provides no re.match function, instead preferring to require an explicit `^` to remove the implicit `.STAR?` prefix.


For anyone else who wants to time the examples without copying & pasting each line:

    python3 -m timeit -n 1 -r 3 -s "import re ; haystack = (' ' * 20000) + 'a'" -c "re.match('\s+$', haystack)"
1 loops, best of 3: 467 usec per loop

    python3 -m timeit -n 1 -r 3 -s "import re ; haystack = (' ' * 20000) + 'a'" -c "re.search('\s+$', haystack)"
1 loops, best of 3: 4.23 sec per loop

Options

    -n  how many times to execute statement
    -r  how many times to repeat the timer
    -s  setup code (run once)
    -c  command to run n times


I guess this must have been the scenario. Perl also defaults to re.search but does not exhibit the pathological case, maybe because it knows it can't find a $ in a block of \s.


By .? you mean .* ?


You are wrong and shouldn't speak authoritatively about things you know little about. Most regex engines, including the one in Perl, do backtrack as described. As a special case, Perl will match backwards from a $ (or in look-behind), in specific cases, but not in the case described, where the $ only appears in one part of an alternation. Similarly, as the other poster noted, most regular expression engines default to 'search' as the primitive, rather than matching only from the start.

So, yes, in a sequence of 20,000 characters not appearing at the end of the string, the typical regex engine will backtrack in a manner exposing O(n^2) performance in this case.

If you're going to test something, use the actual regexes they talk about, and perhaps the language/runtime they talk about.


Perl does a pretty good job at optimizing such things:

    $ time perl -Mre=debug -e '(" " x 100000000 . "X") =~ /\s+$/ && print "MATCH"'
    Compiling REx "\s+$"
    Final program:
       1: PLUS (3)
       2:   POSIXD[\s] (0)
       3: EOL (4)
       4: END (0)
    floating ""$ at 1..9223372036854775807 (checking floating) stclass POSIXD[\s] plus minlen 1 
    Matching REx "\s+$" against "                                                            "...
    Intuit: trying to determine minimum start position...
      Found floating substr ""$ at offset 100000001...
      (multiline anchor test skipped)
      looking for class: start_shift: 1 check_at: 100000001 rx_origin: 0 endpos: 100000001
      Does not contradict STCLASS...
    Intuit: Successfully guessed: match at offset 0
    Matching stclass POSIXD[\s] against "                                                            "... (100000001 bytes)
       0 <> <          >         |  1:PLUS(3)
                                      POSIXD[\s] can match 100000000 times out of 2147483647...
    100000000 <           > <X>       |  3:  EOL(4)
                                        failed...
                                      failed...
    Contradicts stclass... [regexec_flags]
    Match failed
    Freeing REx: "\s+$"
    
    real    0m0.427s
    user    0m0.312s
    sys     0m0.076s
Basically, the regex egine sees a $ anchor, moves to the end of the string, and finds that it can't ever match there.

It makes it quite hard to accidentally trigger such bad regex behavior. See for example https://perlgeek.de/blog-en/perl-tips/in-search-of-an-expone... (disclaimer: my own blog)


> I can't reproduce this "bug" in either Perl or Python.

You're doing it wrong then.

  #!/usr/bin/perl

  use Benchmark;

  sub trim {
      my $s=shift;
      $s =~ s/^[\s\u200c]+|[\s\u200c]+$//g;
      return $s;
  }

  timethis(100, sub { trim("x" . " " x 20000 . "x"); });
With Perl 5.18 I get 2.65/s on a fast iMac.


Is there a difference between greedy and non-greedy atoms?


In the order of searching and hence the match you can get, yes. In performance in the case of a non-match, no.


The regex a?a?a?a?a?aaaaa against the string aaaaa will complete in linear time if you use non-greedy ?s but with greedy matching it has exponential complexity. So, there is a difference.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: