> If the string to be matched against contains 20,000 space characters in a row,...

burntsushi · on July 20, 2016

> I mean, if the engine tried matching from the second space, what would be matching the first space? Something has to.

Some regex engines provide an API call that puts an implicit `.STAR?` at the beginning of the regex so that the semantics of the match are "match anywhere" as opposed to "match only from the start of the string." (This is in fact the difference between Python's `match` and `search` methods.) Assuming the OP was using this type of method, then this difference exactly explains why you can't reproduce it. I can:

    >>> import re
    >>> haystack = (' ' * 20000) + 'a'
    >>> re.match('\s+$', haystack) <-- is wicked fast
    >>> re.search('\s+$', haystack) <-- chugs a bit

indeed, this chugs a bit too:

    >>> re.match('.*?\s+$', haystack)

So technically, the OP left out this little detail, but it's a pretty common thing to find in regex libraries. In fact, Rust's library treats all searches as if they were re.search and provides no re.match function, instead preferring to require an explicit `^` to remove the implicit `.STAR?` prefix.

tedmiston · on July 20, 2016

For anyone else who wants to time the examples without copying & pasting each line:

    python3 -m timeit -n 1 -r 3 -s "import re ; haystack = (' ' * 20000) + 'a'" -c "re.match('\s+$', haystack)"

1 loops, best of 3: 467 usec per loop

    python3 -m timeit -n 1 -r 3 -s "import re ; haystack = (' ' * 20000) + 'a'" -c "re.search('\s+$', haystack)"

1 loops, best of 3: 4.23 sec per loop

Options

    -n  how many times to execute statement
    -r  how many times to repeat the timer
    -s  setup code (run once)
    -c  command to run n times

smegel · on July 20, 2016

I guess this must have been the scenario. Perl also defaults to re.search but does not exhibit the pathological case, maybe because it knows it can't find a $ in a block of \s.

wfunction · on July 20, 2016

By .? you mean .* ?

ghusbands · on July 20, 2016

You are wrong and shouldn't speak authoritatively about things you know little about. Most regex engines, including the one in Perl, do backtrack as described. As a special case, Perl will match backwards from a $ (or in look-behind), in specific cases, but not in the case described, where the $ only appears in one part of an alternation. Similarly, as the other poster noted, most regular expression engines default to 'search' as the primitive, rather than matching only from the start.

So, yes, in a sequence of 20,000 characters not appearing at the end of the string, the typical regex engine will backtrack in a manner exposing O(n^2) performance in this case.

If you're going to test something, use the actual regexes they talk about, and perhaps the language/runtime they talk about.

perlgeek · on July 21, 2016

Perl does a pretty good job at optimizing such things:

    $ time perl -Mre=debug -e '(" " x 100000000 . "X") =~ /\s+$/ && print "MATCH"'
    Compiling REx "\s+$"
    Final program:
       1: PLUS (3)
       2:   POSIXD[\s] (0)
       3: EOL (4)
       4: END (0)
    floating ""$ at 1..9223372036854775807 (checking floating) stclass POSIXD[\s] plus minlen 1 
    Matching REx "\s+$" against "                                                            "...
    Intuit: trying to determine minimum start position...
      Found floating substr ""$ at offset 100000001...
      (multiline anchor test skipped)
      looking for class: start_shift: 1 check_at: 100000001 rx_origin: 0 endpos: 100000001
      Does not contradict STCLASS...
    Intuit: Successfully guessed: match at offset 0
    Matching stclass POSIXD[\s] against "                                                            "... (100000001 bytes)
       0 <> <          >         |  1:PLUS(3)
                                      POSIXD[\s] can match 100000000 times out of 2147483647...
    100000000 <           > <X>       |  3:  EOL(4)
                                        failed...
                                      failed...
    Contradicts stclass... [regexec_flags]
    Match failed
    Freeing REx: "\s+$"
    
    real    0m0.427s
    user    0m0.312s
    sys     0m0.076s

Basically, the regex egine sees a $ anchor, moves to the end of the string, and finds that it can't ever match there.

It makes it quite hard to accidentally trigger such bad regex behavior. See for example https://perlgeek.de/blog-en/perl-tips/in-search-of-an-expone... (disclaimer: my own blog)

lazyjones · on July 21, 2016

> I can't reproduce this "bug" in either Perl or Python.

You're doing it wrong then.

  #!/usr/bin/perl

  use Benchmark;

  sub trim {
      my $s=shift;
      $s =~ s/^[\s\u200c]+|[\s\u200c]+$//g;
      return $s;
  }

  timethis(100, sub { trim("x" . " " x 20000 . "x"); });

With Perl 5.18 I get 2.65/s on a fast iMac.

rhizome · on July 20, 2016

Is there a difference between greedy and non-greedy atoms?

ghusbands · on July 20, 2016

In the order of searching and hence the match you can get, yes. In performance in the case of a non-match, no.

cyphar · on July 21, 2016

The regex a?a?a?a?a?aaaaa against the string aaaaa will complete in linear time if you use non-greedy ?s but with greedy matching it has exponential complexity. So, there is a difference.