Hacker News new | past | comments | ask | show | jobs | submit login
Stack Overflow Outage Postmortem (stackstatus.net)
862 points by gbrayut on July 20, 2016 | hide | past | favorite | 316 comments



Perfect. Awesome bug. Awesome Post Mortem. This was just fun to read.

While this might have been caused by mistake - these types of bugs can be (and are) abused by hackers.

https://www.owasp.org/index.php/Regular_expression_Denial_of... https://en.wikipedia.org/wiki/ReDoS

The post also links to this video: https://vimeo.com/112065252


Well in this case a post contained 20K whitespaces, so I wouldn't jump to the conclusion that it was a mistake rather than intentional.


Yeah, I'm trying to figure out how you even get 20,000 spaces into a Stack Exchange post, and how it would render in your browser.


I can tell you how.

A shitty Belkin KVM in certain configurations can allow this to happen.

There's a bug which keeps generating chr(32) characters when you activate the keyboard shortcut (scroll lock twice), and try to switch to another machine.

It will keep pumping out those spaces on whatever fields was selected at the time, so if you take your time before you switch back, you are going to be in for a lot of fun.

Haven't read the link yet, but wanted to share this sooner than later.


On a related note: Is there any easy way to tell Vim when in insert mode not to accept extra spaces at the end of a line, except a single one?

I've got of course checks in place that warn about trailing spaces and my Git hooks outright refuse a commit with trailing spaces. But it would be nice to catch that at the insert level.

You know, it may be cat who's typing.


Not exactly at insert mode, but this may be helpful:

    autocmd BufWritePre * :%s/\s\+$//e
Before you save a file, it removes all trailing whitespace (ironically using the same regex as in the article -- hasn't given me trouble yet though)


Nice find! Another one I heard about was that Microsoft's Windows keyboards, when used on a Mac, will sometimes insert a Control-P into the middle of your typing if you press the Windows key. Most apps just ignore it, but apparently not all!


Bottle of tequila on the spacebar


A bottle of Tres Commas?


Eh, just type a single space.

Then copy the entire post and paste it a couple dozen times.


Only 35 keystrokes are required to get 20K+ spaces.[1]

[1] https://stackoverflow.com/questions/4606984/maximum-number-o...


I'm thinking perhaps you didn't get the Silicon Valley reference?


Never got around to pirating it. My bad.


Welcome to Reddit News.


I don't read anything programming-related on Reddit anymore, but are they also now too cool to watch TV?


It was in a multiline code block, so it just had a tonne of horizontal scroll.

See the edit: http://stackoverflow.com/revisions/38484433/2


If he had used tabs instead of spaces this wouldn't have been a problem.


Or at the very least, trim whitespace on save :)


Still hasn't killed as many characters as GRRM


Good find! Where did you see this? Or was it in the comments via Nick Craver?


The article said:

    > The malformed post contained roughly 20,000 consecutive
    > characters of whitespace on a comment line that started
    > with -- play happy sound for player to enjoy. For us,
    > the sound was not happy.
So I just searched:

    https://www.google.co.uk/search?q="play+happy+sound+for+player+to+enjoy"+site%3Astackoverflow.com


"deleted 20507 characters in body"

Hahaha


Cat laying its head on the spacebar.

Mine does that on the Control key all the time.



Browsers typically collapse whitespace, so it probably would render as a single space.


I frequently complete eBay feedback 'comments' fields[0] with a variety of Unicode spaces and Firefox at least doesn't seem to collapse them.

[0] eBay insists on something being entered and when it was a routine transaction with a vendor I seldom have anything useful to say.


Not sure if you're a frequent eBayer, but the common behavior in that community is just "A+++++++++" with some arbitrary number of "+" indicating that everything went fine.


He's no doubt observed that and is using whitespace to rail against the ridiculousness of that convention.


was in code tags, so had a very long horizontal scroll bar :P


Not in a pre....


I wouldn't jump to any conclusion ;)

But I do know that tools often make it easy for people do incredibly stupid things by accident. Like the 'sudo remove file' command followed by '-rf' in the wrong place. I rimraf a lot; it's very useful. It's a miracle I haven't wiped out any computers doing so...


I think it'd be possible to inject this kind of thing into your code if you're just starting out with vim, aren't cognisant of all commands you invoke, and then copy/paste all code straight into a browser.


Yeah, I agree, when starting out with VIM, my spacing and tab usage was inconsistent at best.


Have to agree. I read a Post Incident Review like this and I'm like "Yep, totally see how that could happen. Thanks for solving it and thanks for letting me know why it happened".


Ha! The same bug happened internally at my company. In that case it was a regex matching a URL taking so much CPU as to cause a DOS of a proxy server. I won't be surprised if it's happened to someone here too.

This is very timely, because minutes ago, I made a link to Russ Cox's articles in my Kernighan awk repo:

https://github.com/andychu/bwk

https://swtch.com/~rsc/regexp/regexp1.html

If you are not familiar with this issue, basically Perl popularized bad computer science... "regexes" are not regular languages.

They say that this particular case triggered quadratic behavior, not exponential, but the point is that there is a linear time algorithm to do this.

The file b.c in the awk repo implements the linear time algorithm:

https://github.com/andychu/bwk/blob/master/b.c

(and rsc's site has some nice sample code too, as well as caveats with regard to capturing and so forth)


The key quote here is:

"Regular expressions are one of computer science's shining examples of how using good theory leads to good programs ..."

"Today, regular expressions have also become a shining example of how ignoring good theory leads to bad programs. The regular expression implementations used by today's popular tools are significantly slower than the ones used in many of those thirty-year-old Unix tools."

The alarming thing is that regex are supposed to be compiled before use, and the class of expressions that need quadratic or exponential time is distinct from the class that needs linear time. Why don't these popular, widely-used regex implementations perform any optimizations?


So you have the classic NFA and DFA engines, and these more modern NSH (non-deterministic space heater) engines.

At least they keep you warm. Sometimes.

How can it not be a feature that you can heat more space the more spaces you feed it?


> How can it not be a feature that you can heat more space the more spaces you feed it?

You just made my day. Thank you.


Unfortunately, there is a hint as to how this has happened:

"This strategy is no longer practical: users have come to rely on backreferences for at least occasional use, and backreferences are part of the POSIX standard for regular expressions."

What better excuse is there for a poor implementation than standards compliance? In many ways, using regex with backtracking by default is like programming in Lisp without tail-call optimizations. If the POSIX standard is going to require backreferences, then it should also require a non-backtracking implementation for regular expressions without backreferences, just like the Scheme specification requires implementations to support tail-call optimization.

The comparison is valid because they both can create a quadratic runtime from a linear time algorithm.


> If the POSIX standard is going to require backreferences, then it should also require a non-backtracking implementation for regular expressions without backreferences

At least in some regexp engines, this is possible. There is a concept of possessive (as opposed to reluctant, or as it's often called, "non-greedy") quantifiers. Given the string "abb", consider the following regexes:

/[ab]+b/ -- matches "abb"

/[ab]+?b/ -- matches "ab"

/[ab]++b/ -- DOES NOT MATCH!! (Because there is no back-tracking.)

More info: http://www.regular-expressions.info/possessive.html

One regexp engine which implements this (see section 4): https://raw.githubusercontent.com/k-takata/Onigmo/master/doc... -- this is the engine used by modern Ruby versions.


Yes, I recall that he talks about the multi-engine approach with respect to both PCRE and RE2 somewhere in these articles:

https://swtch.com/~rsc/regexp/

Sorry I don't have the exact reference handy. Apparently the author of PCRE tried to implement an NFA/DFA engine for some regexes as well. And I think RE2 might also have different strategies for different regexes as well, but I forget the details (it might be related to capturing).


FWIW, the conversion from NFA (non-deterministic, i.e. backtracking) to DFA (deterministic, linear time) can take exponential space. So there's another avenue for DDOS; it's a lot harder to exploit, though, because it requires the attacker to control the input (i.e. regular expression) to the NFA->DFA transformation, rather than merely provide a string that takes a long time for an NFA to recognize.


I'm actually not aware of any popular DFA based engines that suffer from this vulnerability. grep and RE2, for example, build the DFA lazily and cap the size of the DFA such that matching is still linear time even if generating the full DFA would take exponential space. (This is because at most one DFA state is generated for each byte in the input, so technically, you only ever need space for one or two states. In practice, you give a little more room to avoid constantly recomputing states, but it's still bounded.)


Random side note... Recognized your name, we use your TOML go package in a bunch of our tools. Thanks and keep up the great work!


There is a misunderstanding here -- NFA simulation is NOT backtracking. A NFA can be simulated in time LINEAR to the input length (holding the regex length as a constant). A DFA state corresponds to a set of NFA states. It is an additional optimization to convert NFA to DFA, but it just prevents doing repeated calculation -- it doesn't affect the computational complexity. Cox's articles address this and are quite lucid.

The problem is that the friedl O'Reilly book uses the terms NFA and DFA in a completely wrong and boneheaded way. I was confused by that book for awhile.


Only if you want to model it with every combination of states from the nfa getting one from the dfa. If you don't mind describing the current state as a list of potential states from the nfa, the conversion is linear space.


Usually there's no need to convert NFA to DFA. NFA can be simulated directly, i.e. by using bit vector instead of integer for state accounting. Also, no backtracking is necessary to simulate an NFA.


The name regular expression seems meant to evoke regular languages, whose significance is that they can be recognized in a very straightforward way (by finite state automata) without pathological gotcha cases like this.

Perhaps we ought to call them irregular expressions - this kind of behavior is the literal antithesis of what regular means in automata/parsing theory.


It's a little unfair to complain that they're slower than 30 year old regex engines when the old regex engines were so feature limited that they were nearly useless.


They're only missing one feature: back references. This feature is not needed for the majority of regular expressions, so the 30 year old engines are actually remarkably useful and don't have these pathologies. And actually it was the introduction of back references (in Perl) that causes you to have to implement backtracking regular expression engines.


In addition to back references, Perl also has forward and back assertions, IF/ELSE, and recursion IIRC. I'm not sure if anyone actually USES those features, but they are there.

Python adopted back references as well as forward and back assertions, but not the other stuff.

Perl really turned regular expressions into something else entirely...


Yes, I forgot about assertions. I've never _really_ used them, but I remember they were useful once or twice when wrangling with a particularly ugly format problem.


The first non-boilerplate-header line in b.c:

    /* lasciate ogne speranza, voi ch'intrate. */
Well, that's encouraging...

("Abandon all hope, you who enter here.")


Admittedly I haven't stepped through all the code, but it's 958 lines and doesn't look horrible. It somewhat resembles the example C programs here:

https://swtch.com/~rsc/regexp/

In particular there don't seem to be crazy optimizations and non-portable stuff like you see in other production regex implementations and in interpreters. It's pure ANSI C.

The functions are all short -- no 1000 line monsters like I've seen in a lot of old C code lately.

The rsc example code is actually a great example of using pointers in C. People rightly avoid lots of raw pointers for application code, but for this use case, they yield compact and elegant code.


There's a simple trick the real-time and high-assurance communities use to catch stuff like this: enforce a hard limit for time each task takes to complete. Some you might give leeway to prevent DDOS'ing your users. Many (most?) things have a sane, hard limit that can apply along with a log entry or notification to ops. Keeps one user or task from DDOSing whole system or forces it to fail fast + noticeably.

Note a lot of developers use this trick anymore but it's quite powerful. Should get more consideration. Not to mention the real solution of choosing tools or algorithms with predictable, reliable behavior. Those definitely exist for the task they're performing as other commenters noted.

EDIT to add modern example from high-assurance field that shows how awesome it can be (esp see Github link):

https://leepike.github.io/Copilot/


For the specific example, this is how I dealt with a situation in one service I wrote that takes an arbitrary user-controlled regex (for search).

The program was written in python, and as it turns out the GIL prevents a thread from being able to interrupt another thread that's doing a re.match() call - the entire regex engine is in C so the GIL is held for the entire call.

My solution was to use SIGALRM to set a timer before calling, and have the signal handler raise an exception. This causes the regex call to raise that exception, which I then catch and display an appropriate error.


You can only safely abort a "task" if the task has specifically be designed that way, or if the task is a process.


Which is of course part of his point. Anything that processes arbitrary user input should be designed in a way that is abortable in some way. As this particular case shows even the attempts to sanitize input can be vectors for a DOS against them.


It's why you use isolation techniques outside that code as the fall-back. Simplest forms I saw involved while loops counting downward and threads that got interrupted to assess things.


Everything processes user input. Whether a numeric user id or some text.


That's going a bit far. User input is usually handled by a small percentage of modules. Those should guard against malicious or faulty input where possible.

That's not even the issue here, though. The issue is tgat the input is processed via an algorithm thst is easy to hang and/or format that's hard to parse. On top of that, no isolation or monitor to catch the fault. Each of these can be done differently... are in many programs... to avoid or better mitigate such risks. Developers often dont.


Since those all have a lot more code than is needed to show the basic idea of an NFA-based matcher: https://github.com/darius/sketchbook/blob/master/regex/nfa_s...


Sadly not much Thompson's libraries are implemented. I have tried to find one for F# but are just toy projects.


It doesn't actually take that many lines of code to implement a linear time NFA engine. Most of the code is actually in the regex compiler. That is, there are only a few actual "instructions" or node types in a regex engine (alternation, concatenation, etc.). The rest is just compiling the bizarre syntax to a those nodes/instructions. (And dealing with Unicode if you need that.)

The whole awk implementation is 958 lines, so it can't be that bad. It supports a fairly featureful regex language (though no capturing AFAICT).


if is that "simple" why then are few? Or is because I don't know where to look at?

Maybe regex are just enough and few bother to have something faster?

I have found some start code at https://t0yv0.blogspot.com/2011/02/home-made-regular-express.... The thing is that I don't know how much else is necessary to have a well made library...


I'd look at GNU sed if you wanted a very featureful substitution engine that IIRC is also linear time.


Hm it seems to be a copy of the GNU libc regex engine. And coreutils, grep, and awk also package the same engine in user space! That's really annoying.

But yes it looks like it uses a struct re_dfa_t for matching, and some special flags for backreferences? That seems to indicate that they are using linear time maybe with a fallback? Hm.

I think the general turn of events was that libc supported BRE at first, which was implemented using the DFA algorithm. Then Perl implemented a backtracking engine and Perl syntax for REs, and Python/Ruby/etc. wanted to be compatible with Perl, so they also implemented backtracking as well. Because Perl has other features like forward and back assertions that require backtracking.

And then perhaps libc got EREs with backreferences which they bolted onto the existing infrastructure?

Anyway, thanks for the pointer... I may look into these more in the future.

musl libc also uses the TRE regex engine, which apparently uses the DFA algorithm.


"This regular expression has been replaced with a substring function."

This should be the title of a book on software engineering.


I've fixed so many bugs using regex, only to have to fix several bugs later.

My current stance is, avoid regex if at all possible. Turns out, many of the things we use regex for is possible without. Often times, .Substring, .IndexOf, and using LINQ over strings is sufficient.


Regexes are almost always a massive code smell. They should almost never be used, bad idea, bad implementation, hard to grok, hard to debug, hard to test, hard to spot.

Whoever came up with them has surely been given the same honorary place in hell with Jon Postel, who invented the utterly disastrous "be liberal in what you accept", that has plagued all web developers for the last 20 years.


> Regexes are almost always a massive code smell. They should almost never be used

Regular expressions are just a notation for expressing a finite state automata that recognizes a regular language: if that's the problem that you have, regular expressions are definitely the tool you want to be using. For instance, I recently made myself a small tool to compute the min/max/expected value of a dice roll; the notation for such a thing forms a regular language that can be expressed as follows:

    non_zero_digit ::= '1' | ... | '9'
    digit ::= '0' | non_zero_digit
    integer ::= non_zero_digit digit*
    modifier ::= '+' integer | '-' integer
    roll ::= integer 'd' integer modifier?
    
Converting this grammar to a regular expression is straight-forward and was the correct tool to use.

I agree that regular expressions are often used in contexts where they should not, especially when people start using back-references to recognize non-regular languages, but don't throw out a perfectly good tool because sometimes it is ill-suited.


Regexes are perfectly fine. Awful implementations that accept patterns that aren't regular expressions without complaint and provide little to no tools to look at underlying automata or to step through them during execution are the problem.

It's quite an amazing problem to have honestly because it really shouldn't be a problem to create a proper implementation. You can learn the theory behind regular expression in a few hours and know pretty much everything there is to know.


How are they hard to test? Given input x, expect output y. It's one of the easiest things in the world to test.


The point of this post is that even though the regex behaves correctly (input x produces expected output y), you also need to consider performance constraints.

Without fuzzing, it's going to be pretty difficult to come up with enough test cases to thoroughly test a regex.


Wouldn't the same apply to a custom method? Why would code using a combination of indexOf(), substring() and trim() be any less foolproof on arbitrary data?


Yeah, testing what you expect in regexes is super easy. Edge case testing is not though. Just like exactly what happened in the post this discussion is about..


Where do you think you are, right now?


Where x can be any combination of letters of any length?


This is a load of BS. They are quite capable (and fast) tools in the right hands. And they are easily (and SHOULD be) tested, in any test suite.

As proof, I submit some email header parsing code which I rewrote as a well-commented Regex which was something like 300x faster than using the Mail gem: https://github.com/pmarreck/ruby-snippets/blob/master/header...

Once you know what to look for re: catastrophic backtracking, you know how to avoid it. This is called programmer skill.


>"be liberal in what you accept"

... is a part of so-called unix philosophy, the CAUSE of the internet. The fact that many web developers forgot to become a programmer is just a fun fact changing nothing.


I'll write a trivial state machine over using regex any day of the week. They are easier to write and read (especially if complex), aren't a black box in terms of profiling and don't have surprising performance characteristics like regex does. Unless we're talking about a Thompson NFA (which isn't used by many modern stdlibs) or JS, regexes simply do not appear in my belt of solutions.


And when you actually need complex parsing, parser combinators can express it in a much more maintainable way.


"I had a problem, solved it with RegExp, now I have two problems"


Heh... as a Java developer, my favorite version of that joke is where you solve the problem with Java, and now you have an `AbstractProblemFactoryFactory`.


There's quite a nice history to that one:

http://regex.info/blog/2006-09-15/247

It dates back to August 12, 1997!

(Perhaps someone should encourage jwz to have a 20th birthday party for that at DNA?)


    This regular expression has been replaced with a substring function.
God I wish all my bugs were this easy to fix and deploy


i wish people would stop using regular expressions in situations where they can be replaced with a substring function.


Nick explained on Reddit why the regex was used[1]:

> While I can't speak for the original motivation from many moons ago, .Trim() still doesn't trim \u200c. It's useful in most cases, but not the complete strip we need here.

This would have probably been my train of thought (assuming that I consider regex to be a valid solution):

Trim() would have been the correct solution, were it not for that behavior. Substring is therefore the correct solution. Problem is, IndexOf only accepts a char array (not a set of some form, i.e. HashSet). You'd need to write the <Last>IndexOfNonWhitespace methods yourself. Use a regex and make sure that it doesn't backtrace, because it's expressive and regex "is designed to solve this type of problem." The real problem/solution here isn't substring, it's finding where to substring.

I consider regex too dangerous to use in any circumstance, but I can certainly see why someone would find it attractive at first.

[1]: https://www.reddit.com/r/programming/comments/4tt6ce/stack_e...


Oh totally. I assumed that unicode bs immediately. And anyone would make this mistake easily. That's the point -- gotta have it imprinted in the brains, that regexes are for finding things in files, not for your production code. I've used them myself, but I'd like to think that when i type that regex in i stop and thing whether i will be feeding raw user inputs into it.


Many times regexes are more clear and therefore less bug prone than any non regex alternative. They have their use. Even in production.


example please


Compressing multiple forms of non-unicode whitespace to single space. Used for cleaning text from input fields that often contains unwanted characters from copy/paste.

The regexp for this is simply \s+


What exactly is a substring function, and what makes it different than a regex?


My guess is they're searching for the first non-whitespace character, reverse-searching for the last non-whitespace character, and then using String.Substring to return only what's in the middle.

As to why they're not using String.Trim (https://msdn.microsoft.com/en-us/library/t97s7bs3(v=vs.110)....), maybe it's because String.Trim doesn't seem to know about the 200c whitespace character.


From what I understood, trim would work perfectly. It's 200 spaces not a single 200 width character.


You're misreading the regex. \u200c is a single whitespace character. http://www.fileformat.info/info/unicode/char/200c/index.htm


But that's a weird character to put in a comment line! I don't get how this would happen accidentally.


You really can't think this way when accepting input from users.


What difference does it make if it got in there accidentally or on purpose?

Stackoverflow is a programmers site, you must expect that a programmmer might go, "Hmm, they're trimming whitespace, wonder what happens if I put 20,000 unicode whitespace characters in there instead of normal whitespace"?


Runaway automatic search-and-replace? There's no way to distinguish intent.


Runaway search and replace won't put a single 200 width whitespace character AFAICT


That's the meaning of "runaway" - Notepad++ had a search&replace that went into a somewhat random, long and uninterruptible loop if you were replacing using some types of regex and searhing forward in the file - you had to search backwards.


Also, removing spaces from the start and end of a line is a fairly standard operation which is provided by every language's built in library.


Although, many of them do not handle non-ASCII Unicode whitespace characters (which is what the StackOverflow regex was going for).


Here, the application's notion of whitespace was more comprehensive that the standard library's.


I'm surprised that a developer was able to fix StackOverflow without being able to look up the error message on StackOverflow.


Well, StackOverflow devs are able to cheat and load up the site on their local machine if they want to


Anyone can download the underlying data and look at it on their local machine.

https://archive.org/details/stackexchange


I'm not sure I like the idea of being able to connect to a prod db from a dev instance, but whatever floats your boat.


I doubt they'd allow that either - but database exports are a thing or they could just query the database or elasticsearch themselves.


There are public dumps of Stack Overflow's database. https://archive.org/details/stackexchange


Who says they have to? The can just restore a recent backup.


hah! The catch 22


In the past, I have done Load Balancer status checks against a special /status endpoint. I queried all the connected services (i.e. DB, Redis, etc) with a super fast query (i.e. `SELECT version();`). Monitoring CPU/MEM usage for scaling was separate.

Comparing this to checking the home page, what is the best way to setup a health check for your load balancers?


I think you've got the right approach - a vertical slice through the app that checks every layer. You want to know if a user can get useful info from your site, and it tracks (separately!) the common path their query would follow.

The danger is that the endpoint becomes public knowledge and comes under a DDOS attack. Putting an IP address filter on that endpoint is usually enough to stop that.


The concept I try to go for with that status check is 'Can this node connect to everything so it can successfully respond to http requests'. However my approach wouldn't identify an overloaded server, which might be a good thing if we need to scale up - taking down an overloaded server is just going to make the other servers that much more overloaded.

I'm aways up for hearing about other ways people solve health checks.


> taking down an overloaded server is just going to make the other servers that much more overloaded

Emphatically agree, but it's important in the first place to design and deploy your infrastructure such that basic increases of scale are accounted for - prevention is the most important piece of the puzzle. Get that right and an overloaded server is symptomatic of something else, in which case taking down access to the unruly resource is first priority.

IMO, the big takeaway here is that they were load balancing simply by only hitting the top level - selectivity is somewhat tedious to build but worth it in the long run.


Just because it can connect to everything doesn't mean it can successfully respond though.


> Putting an IP address filter on that endpoint is usually enough to stop that.

Why is there even direct external IP connectivity to the realserver, sidestepping the loadbalancer?


Not all load balancers function at the same OSI layer. If you are using an IP based load balancer with transparent NAT, you don't really have a choice. Requests will be balanced across systems to that health check, but it's still a DOS if it's intensive enough to serve and someone hammers it hard enough.


Lots of times companies will have a 3rd party firm do their availability monitoring for them.


I'd be VERY careful about including external dependancies in an HTTP health check which results in a web server being removed from service - it's usually an invitation for cascading failures.

1) If you do have a back-end failure, this setup can cloud the root cause during recovery because your downstream web servers are down as well.

2) Transitory back-end failures can cascade, and your health checks can make this worse as things flap around. You need to be VERY careful with your timers and retry logic, and tune them appropriately. eg, how does `/status` handle a single TCP reset from the DB? When does it decide to timeout if the DB doesn't respond? It's hard to get this right - network errors vs protocol errors vs timeouts are often handled differently at different layers of the controller, and the load balancer has its own retries and timeouts.

3) If this health check is being used for an API, recovery can be more difficult because of thundering herds. eg, 5/5 API servers go down because of a temporary DB failure. Queues and requests backup. 1/5 API servers is restored to service a bit earlier than the others, and immediately goes down again because it's not degrading gracefully under heavy load. Wash/rinse/repeat until you figure out how to load shed.

4) Are you absolutely positive that your app is worthless without a DB/redis/backend-baz, and every endpoint is broken? Ideally, the app will degrade gracefully if certain request can be served from cache, or don't require the failed backend.

5) The failure case where this type of thing might be useful (network partition affecting DB connectivity from a subset of your webservers) isn't very common in the environments I've been involved with.

A far more productive approach if you absolutely need this sort of thing is to implement circuit breakers[1] at the app level, and explicitly make certain tripped breakers a hard failure for the /status endpoint. This has the advantage of not depending solely on synthetic queries, and having very explicit restoration logic.

For load balancer health checks, I prefer ones that hit the controller and expect a 200, and nothing more. (The DB and overall health of the service cluster are monitored through more organic instrumentation.)

I was laughing to myself when I read the Stack Overflow post mortem because I have a healthy pile of healthcheck-too-expensive war stories, and their experience resonated.

[1] http://martinfowler.com/bliki/CircuitBreaker.html


You make a good point about health checks. Its easy to conflate single node health with overall service health. The LB check should really be about "can this node serve HTTP traffic", even if the non-health check requests can only send back 5xx responses because their dependencies are failing.

I've also had the opposite problem where the health check would answer 200 OK, but the real app itself had a corrupted internal state because of a poor design/threading bug. If the health check had been the home page the node would have been pulled from the LB. While a more in-depth health check would have helped here, I think its better to alert/monitor/kill any node with a higher than normal error rate and leave the health check simple.


You make a good point, but I don't see it as black and white.

> 4) Are you absolutely positive that your app is worthless without a DB/redis/backend-baz, and every endpoint is broken?

Yes, if that's the case you should not respond "unable to serve". However, you should probably fail setup check and not start serving traffic if you're a newly provisioned node.

But!

I wouldn't call this cascading failure more than error propagation. If a critical dependency is down making the node essential worthless, it should be removed. This can be caused by things like network partitions and whatnot - as you say by 5. You say "it's not likely" which is sadly a common statement, but when it does happen it can be nasty - and the bigger environment the often it happens.

Cascading failure usually (in my experience anyway) refers to having the actual failover cause more outages. In the case of failing /status I can only see this happening if the architecture of the system is broken anyway - or the load is genuinely too high for the failure that just occurred.

You say: "The DB and overall health of the service cluster are monitored through more organic instrumentation". What is acting on that? What happens when you do get that rare network partition and half of your web nodes cannot reach the database?


> For load balancer health checks, I prefer ones that hit the controller and expect a 200, and nothing more

A very sound strategy, IMO. I subscribe to this KISS practice. If response/status != 200, issue an alert and/or call a deeper health check routine.


This is the right way. It's too easy to bork everything otherwise; yesterday I borked a new build server because I set security up to disallow anonymous access, which lead to the main page returning a 302 redirect to the login page and thereby failed the load balancer health check.

You live and learn.


I remember the day I learned that Python's "re" module uses backtracking for non-extended regexes. My tests covered lots of corner cases in the regex logic, but were too short for me to notice the performance penalty. Luckily I only caused a partial outage in production.

I actually got to talk to Raymond Hettinger (Python core team) about why re uses a potentially exponential-time algorithm for regexes when there is a famous linear-time algorithm, and (I suspect) most programmers would assume linear complexity. As it turns out, there was an attempt to re-write re to fix this, but the re-write never managed to present exactly the same (extremely large) API as the existing module. He advised me that "the standard library is where code goes to die."


I vaguely remember a replacement for Python's re module, but I can't remember the name, and Googling for it is an exercise in frustration.

Edit: it's "regex" - https://pypi.python.org/pypi/regex. I have no idea if it behaves better with backtracking, a cursory glance at the documentation doesn't indicate any changes in that area.


regex module shows the same behavior as re module in this case:

  $ python -mtimeit -s "s=' '*20000; import regex as re" "re.search(r'\s+$', s)" # ends with spaces is fast
  10000 loops, best of 3: 201 usec per loop
  $ python -mtimeit -s "s=' '*20000+'non-space'; import regex as re" "re.search(r'\s+$', s)" # ends with a non-space is slow
  10 loops, best of 3: 3.39 sec per loop


I looked at that module a while back. The goto statements make it very difficult to analyze. It seems to integrate well with the signals module, which could be used to interrupt a large class of ReDoS attempts, but determining whether all potential regexps could be interrupted using signals, with the goto statements, is a difficult task.


I started writing a replacement in Python that doesn't use backtracking. https://github.com/cyphar/redone


I finished (but never optimized it): https://bitbucket.org/devin.jeanpierre/re0


This is why I always do:

  s/^\s+//;
  s/\s+$//;
Instead of:

  s/^\s+|\s+$//;
Weirdly, I've "known" this since I started writing Perl in the mid-'90. Not sure where I originally read it (or was told it). Funny how that works.

I try to write my regexes such that they anchor at the front of the strong or the back, or they describe the whole string; never an either-or anchoring type situation like this example.

Spaces at beginning of string (100,000 iterations):

             Rate onestep twostep
  onestep 62500/s      --     -2%
  twostep 63694/s      2%      --

  real	0m3.093s
  user	0m3.066s
  sys	0m0.018s
Spaces at end of string (100,000 iterations):

             Rate twostep onestep
  twostep 55249/s      --     -9%
  onestep 60976/s     10%      --

  real	0m3.453s
  user	0m3.421s
  sys	0m0.022s
Spaces in middle of string (only 500 iterations because I don't want to sit here for four hours):

             Rate onestep twostep
  onestep  7.11/s      --   -100%
  twostep 16667/s 234333%      --

  real	1m10.741s
  user	1m10.207s
  sys	0m0.228s


You can see in my other post that this doesn't always work. For example, Python's regex engine chokes on `\s+$` if you use `re.search`, but works fine if you use `re.match`.

It's likely that Perl has an optimization for `\s+$` but not `^\s+|\s+$` (the former regex only ever matches at the end of the string, which is amenable to optimizations).


I did see your other post, and upvoted it. This rule of thumb has served me well between different regex dialects and implementations, but it's not surprising that there are some specific cases that are "broken" for lack of a better word.

I haven't done much Python but the documentation for re.search() and re.match() is very clear: use search to find an expression anywhere in a string, use match to find an expression at the beginning of a string. It appears to ignore anchors in both cases? Left undetermined then is how to anchor an expression at the end of a string. You say re.match() works, but this is pretty confusingly described, and it's easy to see how the ambiguity can lead to problems for even the most experienced programmers.


Anchors aren't ignored. For re.match, `^abc$` is equivalent to `abc`, so the anchors are just redundant. (N.B. `^^abc$$` will match the same set of strings as `^abc$`.) For re.search, `^abc$` is equivalent to `re.match(^abc$)` but `abc` is equivalent to `re.match(.STAR?abc.STAR?)`.

But yes, this is a very subtle semantic and different regex engines handle it differently. My favorite approach is to always use `.STAR?theregex.STAR?` and don't provide any additional methods. Instead, if the user wants to match from the beginning to the end, they just need to use the well known anchors. (Go's regexp package does this, and Rust borrowed that API choice as well.)


Correction: re.match anchors at the start, but not the end, so `^abc$` is equivalent to `abc$` but not `abc`.


I don't understand something: the regex expected a space character, followed by the end of the string. If the last character wasn't a space, this could never match. Why did the engine keep backtracking, even though it's easy to figure out that it could never match the regex?


Most simple regular expression evaluators are basically state machines. They hold a few variables:

a) what part of the regex am I currently trying to match

b) what point in the string am I currently starting at

c) how much of the string has this piece of the regex consumed so far

Then the state machine basically has three transitions:

* if (b+c) terminally matches (a), increment both (a) and (b) and reset (c)

* if (b+c) matches (a), but more could be consumed, increment (c) and check again

* if (b+c) doesn't match (a), increment (b) and reset (c)

So yeah, you could put in short cuts for things like "well this didn't match because the next piece of the regex is matching on a line ending", but keeping it simple is straightforward and generally smiled upon.


If you are really using a state machine and every match in the regex is at the end of the haystack, then you can reverse the state machine and use the same algorithm you described in reverse. :-)

(Rust's regex engine does this.)


Perhaps there simply isn't a separate code path for the presence of an end-of-string anchor and the regex is evaluated left-to-right like any other?


Because if the engine only examined each character at most once, then it would not be a backtracking RE engine.


You make a valid point. Either their analysis is dodgy, or the regex engine is very, very poor.


Is this the sort of thing that https://github.com/google/re2 was made to solve?


Yes, see my other post in this thread... Russ Cox is the author of RE2 and one of the articles I linked goes into its implementation.


Also implemented in the go stdlib https://golang.org/pkg/regexp



And it doesn't have any problems with this particular regex: https://play.golang.org/p/7UFkG3qrpS


It's been mathematically proven that engines of the form that Go has will always run in linear time for any regular expression, on any input. It's one of the more famous things about regex in general.


Yeah, the go team seems to be (rightfully) proud of their regexp implementation.


I think this might have been the post they quoted.

http://stackoverflow.com/questions/38484433/in-corona-sdk-ho...


Yeah, Nick "deleted 20507 characters in body" five hours ago: http://stackoverflow.com/posts/38484433/revisions


> deleted 20507 characters in body - Nick Craver

You are very likely right. Though it looks like they permanently deleted those characters because they don't show up on the edit history.


Yep.


Odd that this was the first post by the user.

Mere coincidence?


I imagine that's very common on SO (users with very few, or just 1, post)


A few months ago, a Stack Overflow representative asked me if their presence at a dev conference was justified. My positive answer more or less revolved around the importance SO took in the daily life of programmers everywhere.

If only she was there to witness the effect of a 34 minute downtime on an open space full of mobile/back/front developers.


Google's cached versions of SO answers are generally up to date, no? Nice try, lazy developers!


Alternatively, you can refer to StackPrinter [0] (especially if arriving via a search engine).

[0] http://www.stackprinter.com/


There are plenty of other sites that seem to scrape Stack Overflow questions and answers.


Nice bug. I tried to replicate this and indeed, the time to notice that no match is found is growing very fast with the length of the input. Using a substring check is a good fix, but I tried to change the regex to fix this and: if instead of an end anchor, you can add an optional non-whitespace character at the end of the pattern, then you only have to check whether the optional part is empty. Testing with very long strings which respectively match and don't match shows that the result is immediate in both cases.

    (defparameter *scanner*
      (ppcre:create-scanner
       '(:sequence
         (:register
          (:greedy-repetition 1 nil :whitespace-char-class))
         (:register
          (:greedy-repetition 0 1 :non-whitespace-char-class)))))

    (let ((length 40000))
      (defparameter *no-match*
        (let ((string (make-string length :initial-element #\space)))
          (setf (char string (1- (length string))) #\+)
          string))
      
      (defparameter *match* (make-string length :initial-element #\space)))

    (defun end-white-match (string)
      (ppcre:do-scans (ms me rs re *scanner* string)
        (when (and ms
                   (= (aref re 1) (aref rs 1)))
          (return (values ms me)))))

    (time (end-white-match *match*))
    0, 40000
    ;; Evaluation took:
    ;;   0.000 seconds of real time
    ;;   0.000000 seconds of total run time (0.000000 user, 0.000000 system)
    ;;   100.00% CPU
    ;;   25,139,832 processor cycles
    ;;   0 bytes consed

    (time (end-white-match *no-match*))
    NIL
    ;; Evaluation took:
    ;;   0.000 seconds of real time
    ;;   0.000000 seconds of total run time (0.000000 user, 0.000000 system)
    ;;   100.00% CPU
    ;;   11,105,364 processor cycles
    ;;   0 bytes consed


So, you're saying that if you enjoy tricky gotchas and puzzle-solving over reliability, then the perl-aping regex implementations have your back?


I am saying that I have been nerd sniped. Actually I tend do stay away from regexes.


Hmm. I wonder why one of the followup mitigations is not to move to a non-backtracking regex engine by default.

Most of what you want to do with a regex can be done with an NFA or DFA based engine. That which can't be done with an NFA or DFA based engine is generally better handled with a parser than a regex.

There are plenty of good DFA based regex matchers out there; RE2, the Rust regex crate, GNU grep, etc. At a glance, it even looks like glibc uses a DFA, though it supports POSIX REs which support backreferences so it must use backtracking at least for REs that contain backreferences.

Predictable hash collisions were a big sources of DOS attacks in web scripting languages which use tables a lot, until they started rolling out randomized hashing algorithms to prevent easily predictable hash collisions. It seems like it would be best for languages and libraries to move to DFA based regexps, at least for anything that doesn't contain backreferences, to mitigate these kinds of issues from being easy to exploit.


> It took 10 minutes to identify the cause.

I'm impressed they were able to do this so quickly.


System not responsive. Look at the CPU load. Look at the process peaking at 100%. Force dump the stack track of the process couple times. Hmm. All of them stuck in the regex engine. Look back up the stack track to see who calls it. Oh, it's on the home page's text cleansing code.

Something like that.


This is exactly what we did to diagnose (source: I was on the call). The only tricky part was figuring out which post it was, since it wasn't in the stacktrace. To do that, we grabbed the 3000 most recent posts and ran the regex against them. By that point we already had the code fix (another dev working on it in parallel), but if we hadn't we also could have gotten back up by just deleting the post.


Kudos for thinking clearly under tremendous pressure. A production server down is always a highly stressful situation.

If time pressure is not a factor and the production server has the VStudio or WinDbg installed, attaching the debugger to the process can see the data related to the post. But the symbol file and source files might be needed; it's just more hassle to set up. For stressful situation, simple steps and simple tools are more useful. Whatever works.


Any idea if it was a malicious attempt? It kind of sounds like it was.


Nah, https://en.wikipedia.org/wiki/Hanlons_razor

If it was malicious they would have made a bunch of them, not just one. I personally have seen many files with unreal amounts of whitespace at the end.


Yeah, that's kinda what we're thinking. If it were malicious, the question itself would be unlikely to actually look like a real honest question, too. It'd just be a bunch of garbage input.


Yes, we took a stackdump and saw the traces leading to that regex.


How large was the stack dump? I'm impressed you identified it this quickly considering the time it would take to get this to disk.


We have an internal tool where we can dump stack traces almost instantly by attaching to a running process. We can't open source it due to using Microsoft lab code which was never licensed itself. However, clrmd (https://github.com/Microsoft/clrmd and https://www.nuget.org/packages/Microsoft.Diagnostics.Runtime) means an open source version is hopefully on the horizon. As an example, here's the result of me piddling in 2 hours with LinqPad: https://gist.github.com/NickCraver/d6292c0c7f93767686e8c5c89... Process attachment is in the clrmd roadmap, but isn't stable yet.


Awesome. Thanks for the response, love your work Nick.


ADPlus + WinDBG + SOS possibly.


I'm guessing they have the right tools to identify the problem. Wish they went into that a little bit more.

Probably went like this: 'web server crashed. so did another. what page did they crash on? ok, let's take a look at the post on that page. what in the....'


Agreed that's impressive debugging for this issue. But...

> 10 minutes to roll out the fix

That seems very slow to me. 30% of their down time was because their deploy process is slow.


http://nickcraver.com/blog/2016/05/03/stack-overflow-how-we-...

FWIW here's a write up on their process

Also I imagine that 10 minutes included dev and testing, not just the deployment part of "rolling out"


Seemed to say that 14 minutes were spent writing the code to fix it which I assumed meant testing it, but ya not entirely clear.

Maybe "deploy" means the "Deploy" section of this article: http://highscalability.com/blog/2014/7/21/stackoverflow-upda...

Seems to target only being able to "deploy 5 times a day". I guess maybe the build time is the limiting factor.


Correct. 10 minutes was from checkin to all servers built out, including a dev and staging tier.


10 minutes roll out to production is insanely fast. Roll out usually goes through build, test, staging, and to farms of production servers, with smoke tests in each stage along the way.


Rollout to thousands of servers on WordPress.com is typically less than 60 seconds. We optimize for reverting fast.

Its just interesting to me the implications of what folks optimize for and that this is considered fast. We have very minimal deploy testing and optimize to be able to revert quickly when there are problems because performance issues like this are very hard to predict. Probably means we create many smaller short hiccups though (that generally are not a full site crash).


I guess it was just:

- login to server

- make dump of all threads stack traces

- see that something like Regexp.match present in all stacks

- find function that called this regexp.


Time to pop this old chestnut out:

https://blog.fastmail.com/2014/12/14/on-duty/

"At one stage, we decided to try to avoid having to be woken for some types of failure by using Heartbeat, a high availability solution for Linux, on our frontend servers. The thing is, our servers are actually really reliable, and we found that heartbeat failed more often than our systems - so the end result was reduced reliability! It's counter-intuitive, but automated high-availability often isn't."

One of these days we'll finish our new system and I'll blog about that, which is that the automated systems are allowed to take ONE corrective action without paging, at which point they flag that the system is in compromised state. Any further test failures trigger an immediate wake of the on-call.


I wondered about this for some time.

Simple regex (as in formal language theory) are matched in O(n) time by finite automaton.

Extended regex like PCRE are more powerful, but most of the time are implemented by backtracking engines, where really bad regex pattern might go exponential, but even simple pattern as in postmortem can go O(n^2).

Do implementations optimize simple regex patterns to O(n) matching? Even I wrote x86 JIT regex compiler for fun some time ago. Compilation time was really bad, but matching was O(n).


There are a few implementations that are linear, but compilation time is then exponential instead.


Which is still a big win because

* regex pattern is controlled by site, while regex input is external

* regex pattern is compiled once, while it is being run for every input


As perlfaq4[1] shows:

    > You can do that with a pair of substitutions:

    >         s/^\s+//;
    >         s/\s+$//;
It then notes, in an understated manner:

    > You can also write that as a single substitution,
    > although it turns out the combined statement is 
    > slower than the separate ones. That might not 
    > matter to you, though:

    >        s/^\s+|\s+$//g;
[1]: http://perldoc.perl.org/perlfaq4.html#How-do-I-strip-blank-s...


In light of the given issue, it might be a good idea to update the example code to use \s++ instead...


Experienced something similar myself. Was even thinking about creating regular expression library which just allow "safe" and fast expression.

The trick would be to not allow only expression that can be translated easily to state automate.

Good regex: "Phone number [0-9]* "

Bad regex: ";Name=.;" as . can also match ";" and it can lead to bad backtracking. You should rewrite this regex to ";Name=[^;];"

RE2 is probably best implementation so far, but because it's tries so hard to preserve backward compatibility with all regular expression it is not that fast in average case: https://swtch.com/~rsc/regexp/regexp1.html


"the entire site became unavailable since the load balancer took the servers out of rotation." I don't care about the regexp, this is bad SRE, you can't just take servers out of rotation without some compensation action.

Never mind that it looks like all web servers where taken out of rotation, even one server down could cause a cascading effect (more traffic directed to the healthy ones that end up dying, in a traffic-based failure). One action for example after n servers have gone down, (besides getting up other m servers) is to put (at least some) servers in a more basic mode (read only/static, some features disabled), not guaranteed but that could have prevented this and other type of down times.


I have the same concern - including with our own system. As far as I know, all LBs are designed with this faulty logic. That's the case with our in-house F5 BigIPs, and I believe also for AWS's ELBs.

Sure, it makes sense to cut malfunctioning servers out of the production pool when discovered. But at the point where you've dumped out every production server, you've just created the worst possible situation - you're absolutely guaranteed not to be able to serve any traffic. At the point you get to zero functioning servers, a better response would be to restore all servers to the pool and just hope. That would be bad, but less so.

What we've done on our internal systems is to set things up so that when the last server is dropped, it fails over to a passive pool that is really just the normal production pool with no health checks. But I'm not aware of any LB that supports this directly, we had to glue together some ugly configurations to get this to work.


I was thinking the same while reading that part, and I also find it very scary. Why would you even _let_ the load balancer take off _all_ of your servers?


Yesterday I couldn't use hipchat for a couple hours because it would lock up a cpu and fail to load. After doing some free debugging for them I realized they were locking up trying to extract urls out of some text with a regex. Simplified code: https://gist.github.com/shanemhansen/c4e5580f7d4c6265769b0df...

Pasting that content into hipchat will probably lock up your browser and webview based clients. Beware.

Lesson learned: don't parse user input with a regex.


Google cache saved me during these 34 minutes.


Mac app Dash allows downloading & browsing Stack Overflow offline by language. 834 MB for Python.

https://kapeli.com/dash


Funny, I was coding on a Chromebook with 525M left in "/".


It seems like there should be a way to determine whether a regex can be compiled using the classic O(n) DFA algorithm or with whatever madness PCREs use to support backtracking and so on.

Anybody know if any regex engines attempt this?

Obviously you can still shoot yourself in the foot, but it's somewhat more difficult to do so in a situation like this where the regex in question "looks" cheap.


> It seems like there should be a way to determine whether a regex can be compiled using the classic O(n) DFA algorithm or with whatever madness PCREs use to support backtracking and so on.

Obviously. If the "regex" includes a backreference, it requires backtracking. If it includes only regular operations (I can't call any other nonregular operations that people might expect to mind), it doesn't. This is information that can be trivially surfaced by a parser.


I wonder why regex libraries don't do this then? Isn't /\s+$/ backreference free? Or does including the anchor change that?


It's backreference free. Backtracking to process it is just bad design.

Here's a state machine that recognizes /^.*\s+$/:

1: initial state (non-matching)

2: matching state

all states transition to 2 on \s, and to 1 on \S. Done.

Taking this a little further, we can observe that since the transition table is identical for every state, there's no real need to have states at all -- you get the same effect by just checking the last character.


As an alternative, in Java, there is FindBugs plugin "find-sec-bugs" that flags potentially-long-running regexes.

In my current project, it found two of those in Google Zxing library.


For people who already care enough about this there is Ragel.


Regex was not the main issue. The main issues were:

1. Rendering a page fails/does not terminate if some non essential subtask (rendering a single code block) fails/does not terminate.

2. They do not try to detect bad data (the way they certainly try to detect bad code)

3. Load balancing based on the rendering time of a single page

Code bugs triggered by bad data will happen again, with or without regular expressions.


> 2. They do not try to detect bad data (the way they certainly try to detect bad code)

It was actually the sanitization function regex that became unresponsive. So they killed the page by trying to stop bad data coming in.

Agreed on your other points though.


Could this has been a deliberate/malicious act? Why else would someone post 20,000 consecutive characters of whitespace on a comment line?

Also, the "homepage" of StackOverflow does not show any 'comments' - it is just the top questions? Why was the page loading any comments in the first place?


We don't think it was intentional. Maybe copy and paste, or something an editor did.

To clarify the "comment" was not a Stack Overflow comment, but rather a comment in a code block inside a question.


I am genuinly curious: how did you fix it? Did you remove the spaces first and then tried to use substring / trimming with proper testing done, or did you just implement it in place? I have faced similar dilemmas in the past, but I usually go with "put out the fire, then find the correct solution" approach.


Implement in place to put the fire out. Pushed to half the web servers, made sure it fixed the problem, then rolled it out the rest. Coding under fire :P


How did you test this on half? Accessing the single web server directly I'm guessing? Probably not public


May we see this code comment? Sounds interesting.


Imagine if they had only posted 10,000 or 15,000 characters, and it just slowed the site down. How fast would it have been noticed? Hours? Days?


Good monitoring (which I expect the SO guys to have) would have triggered on a spike in 95% or 99% response times so probably almost as quickly


Actually we average less than 10% CPU on the web tier... so anything over that warrants investigation. Graph of CPU during today's issue at https://twitter.com/Nick_Craver/status/755793398544601088


on a site like stackoverflow, I would think they get malicious requests like that all the time. Even if it wasn't malicious (perhaps an otherwise benign script gone amuk), I bet something like this happens very often.

It would be interesting to find out if something like this goes into their automated regression tests. :)


more likely select all copy + paste


We had a similar issue arising from regex parsing of our SES routes on our SaaS Platform. We had made some changes to our generated SES file which caused it to balloon to 4x in size (tens of thousands of lines). Our only clue that something had gone wrong was suddenly extremely high IIS usage. With some help from Microsoft support, we managed to trace the stack during the high-cpu event to an ISAPI filter and ultimately our 3rd party SES plugin. We managed to fix the problem by being more efficient with our regex generation and reduce the number of rules the plugin was processing but it was eye-opening how much CPU was being consumed by regex processing.


I like this because it shows how important it is to understand the inner workings of the tools in your toolbox. It could serve as a nice example in some 'Languages and Grammars' course at the University for additional motivation.


They implemented trim with a regex? Neither Java nor .NET do that.

The postmortem here should probably be "why are you reimplementing trim".


Very old code, hard to say what the thinking was at the time. That is why we are doing an audit looking for any unneeded regexes, or regexes that are susceptible to backtracking


Most likely answer: The built-in trim isn't Unicode aware or has buggy behavior when dealing with Unicode.


.NET builtin trim lets you specify which 'char' values to trim [1]. .NET is notoriously tied to UTF-16 (a char is 16 bit), and you have to handle surrogate pairs very carefully, but I can't really imagine that being any better with the built-in System.Text.RegularExpressions.Regex

1: https://msdn.microsoft.com/en-us/library/d4tt83f9(v=vs.110)....


Assuming they're still using ASP.net 4+, it is very unicode aware/safe. I don't know why a developer would reinvent Trim() but I do know it isn't a .Net limitation.


This code was written 5 years ago, and back then the trim function was different.


My guess is developer cleverness. As the article mentions they ended up with a less clever solution.


I like the trim() functions that accept a list (of some form) for 'characters to remove'. It is far more deterministic (better worse case outcome) to follow this approach than it is to run a regexp.


Easy to reproduce [1]. Just remove the a in the end and your timeout disappears. Anybody knows which regex engine they used?

[1] http://regexr.com/3drn3


I had a tough time reproducing it in Perl, although I think I managed to get it.

  % perl -v
  This is perl 5, version 18, subversion 2 (v5.18.2) built for x86_64-linux-gnu-thread-multi
  % ls -l testfile
  125380092 Jul 20 16:41 testfile
  % time ./regextest testfile
  ./regextest testfile  0.36s user 0.03s system 99% cpu 0.398 total
  (Remove the a at the end)
  % time ./regextest testfile
  ./regextest testfile  0.13s user 0.03s system 99% cpu 0.156 total
I had to make a much larger input file than I expected before I was able to really see the difference. Given the O(n^2) nature of the bug, I was expecting it to take much longer.


I could not reproduce the slowness in Perl, but I could in JavaScript (specifically node.js).

There are many, many modules that use /\s+$/, including at least one form validator.


I wanted to figure that out. I found a meta post stating the SO uses C#, so that might narrow it down.


I'm still confused why people would use a backtracking regex engine in cases when they don't need recursive regex extensions (or other questionable extensions like back references). A "correct" (from the CS perspective) regex engine wouldn't have had this or many other problems that people encounter when doing regular expression matching. If they had piped out to sed or awk this wouldn't have happened, since GNU grep, sed and awk use a proper regex engine.


My blog post[1] on how to test for catastrophic backtracking using RegEx buddy.

[1] https://blog.mozilla.org/webdev/2010/11/15/avoiding-catastro...


Backtracking regexes matchers are a Bad Idea.

It's true you need them to implement backreferences. But I've never used such a thing. If I were creating a runtime for some new language, I would simply ignore that part of the POSIX standard.


The whole postmortem focuses on a regular expression bug and how that bug was fixes and completely ignores the fact that if the home page becomes unavailable, the load balancer logic will shut down the entire site.


I still haven't figured out why regex engines font use state machines where possible (i.e. in the absence of back references and such). Is that not an obvious optimization?


ugh. i would've just sat there wondering WTF. then proceed to initiate daily backup recovery.


Haha, really!

Maybe I'm less-experienced (1.5 yrs), but it would have taken me significantly more than the 15 mins(was it?) that it took them to diagnose the problem.

Props to the team, and Thank All 0.33G Gods I was not on call!


    > It took 10 minutes to identify the cause,
Impressive, considering:

    > cause was a malformed post that caused one of our
    > regular expressions to consume high CPU ... called on
    > each home page view ... Since the home page is what our
    > load balancer uses for the health check, the entire site
    > became unavailable since the load balancer took the
    > servers out of rotation.


Not understanding why backtracking happened. Once it hit a non space, non end character, move on. Nothing before can match the regex.


It finds a space, tries to match the rest of the string, fails, rolls back to it and goes forward one, finds a space...

The trick is that there's no guarantee, in general, that a match failing at character N is due to character N, so the regex engine backtracks.


I guess I need to buff up on my automata, but I would have thought that for positional delimiters, there would be optimizations. If the pattern is \s+$, then I can look from the back of the line, see if there is a space, and if so, go backward until I find a non-space character.


That's just special casing the end of line. If it was "\s+x" you'd still have the same problem.


Seems like there is still no better way to deal with these kind of mistakes than preemptive Erlang-style lightweight processes.


Someone wiser than me said once that if you have a problem and want to fix it with a regex then you now have two problems :-)


By the way, this is the post that broke StackOverflow: http://stackoverflow.com/questions/38484433/join-tiles-in-co...


For me awesome part is cliché that is quite popular on SO takes down SO. And resolution to replace RegExp with substring completes the picture. Just cannot stop laughing.

"Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems."


    > 20,000+19,999+19,998+…+3+2+1 = 199,990,000
= 200,010,000, not that anyone's counting :)


In reading this post, I realised this was the first time I'd ever visited the Stack Overflow homepage.


> Add controls to our load balancer to disable the healthcheck – as we believe everything but the home page would have been accessible if it wasn’t for the the health check

Wouldn't regular users, trying to access the homepage have yielded the same effect?


> This regular expression has been replaced with a substring function.

I always cringe when I see regex used for such simple string checks. In fact, Stackoverflow is full of accepted answers that "solve" problems that way.


paging jwz. something something two problems.


The Stack status page contains 3 script tags before the HTML tag.

This is what I saw on my Kindle 3 Keyboard:

This page contains the following errors:

error on line 2 at column 36: Extra content at the end of the document

Below is a rendering of the page up to the first error.

var __pbpa = true;


They have limits on everything (comments per second, edits per second, upvotes per day, reputation earned per day, etc), it seems like they should have an upper bound character limit on what they accept too.


There are some very long Stack Overflow answers that contain a wealth of information. If there were a limit, it would likely need to be far greater than 20k characters anyway (the number of whitespace characters that caused the outage).


They do. It's 30K characters. Occasionally you see genuine posts bumping up against it.


For more bugs caused by quadratic complexity:

http://accidentallyquadratic.tumblr.com/


Any more proof needed that caching should become a system-provided service over the next 10-20 years, the same way memory management did in the past 10-20 years?


If it was in a comment, why was the home page loading it?

preemptive caching?


" This regular expression has been replaced with a substring function." I came to rely on Regex so much that I almost feel we'd be the next.


This is great. I just want to add something that might not be well-known: StackOverflow is all hosted from ONE web app server! It handles all the writes.


Productivity plummets worldwide (regex attack vector)


Correct me if I'm wrong, but couldn't this could have been fixed by making the match possessive:

^[\s\u200c]++|[\s\u200c]++$

That should stop any runaway backtracking?


200010000 not 199990000 probably the author looped on a 0-based index. n*(n+1)/2 is even better ^^ Nice post mortem though


TIL what language Stack Overflow is written in.


what language is that? Couldn't find it in the post.


C#


Yes, one of the best postmortems, especially the technical part.

Am sure it was simple but curious to know what the replacement substr code is.


Favourite comment from the Reddit thread on the matter:

"Well, that should stave off the imposer syndrome for another couple of days."

-u/minno


Very interesting bug. People forget some times how expensive a regex can be compared to simple pattern matching.


Why isn't the trim applied when the post is created and not every time that it's displayed?


Regex: ruining your life since 1956.


I want to believe that a cat fell asleep on the space bar. Then eventually woke up and posted.


This is great- regex errors always reminds me of this classic Jeff Atwood post (cofounder of StackOverflow): https://blog.codinghorror.com/regular-expressions-now-you-ha...


Hmmn, if it was one request, how did it cause other web servers in the farm go down?


Wow, I didn't notice today! I must not have been coding very much.


But... how did they search for a fix without resorting to Stack Overflow? :)


> If the string to be matched against contains 20,000 space characters in a row, but not at the end, then the Regex engine will start at the first space, check that it belongs to the \s character class, move to the second space, make the same check, etc. After the 20,000th space, there is a different character, but the Regex engine expected a space or the end of the string. Realizing it cannot match like this it backtracks, and tries matching \s+$ starting from the second space, checking 19,999 characters. The match fails again, and it backtracks to start at the third space, etc.

That's not how backtracking works. A regex engine will only backtrack to try and make the rest of the regex match, i.e. it will take characters of the RHS of the string, not try and start "from the second character off the start of the string". I mean, if the engine tried matching from the second space, what would be matching the first space? Something has to.

Which meant, that even if the regex engine was incredibly stupid and could not figure out that a greedy block of \s was never going to contain a non-\s, it would only have to check 20,001 times, not 199000 (or whatever it was).

I can't reproduce this "bug" in either Perl or Python. The time taken to match a 30,000 block of space either followed by $ or XX$ was basically identical for \s+$.

There does appear to be normal backtracking going on, roughly doubling the search time for large strings terminating in non-\s. This is expected, as it has to check 20,000 during the first gobble, then 20,000 as it backtracks from the right 20,000 times.

    $ time perl -e '(" " x 100000000 . "X") =~ /\s+$/ && print "MATCH"'

    real	0m0.604s
    user	0m0.509s
    sys	0m0.094s

    $ time perl -e '(" " x 100000000) =~ /\s+$/ && print "MATCH"'
    MATCH
    real	0m0.286s
    user	0m0.197s
    sys	0m0.089s


> I mean, if the engine tried matching from the second space, what would be matching the first space? Something has to.

Some regex engines provide an API call that puts an implicit `.STAR?` at the beginning of the regex so that the semantics of the match are "match anywhere" as opposed to "match only from the start of the string." (This is in fact the difference between Python's `match` and `search` methods.) Assuming the OP was using this type of method, then this difference exactly explains why you can't reproduce it. I can:

    >>> import re
    >>> haystack = (' ' * 20000) + 'a'
    >>> re.match('\s+$', haystack) <-- is wicked fast
    >>> re.search('\s+$', haystack) <-- chugs a bit
indeed, this chugs a bit too:

    >>> re.match('.*?\s+$', haystack)
So technically, the OP left out this little detail, but it's a pretty common thing to find in regex libraries. In fact, Rust's library treats all searches as if they were re.search and provides no re.match function, instead preferring to require an explicit `^` to remove the implicit `.STAR?` prefix.


For anyone else who wants to time the examples without copying & pasting each line:

    python3 -m timeit -n 1 -r 3 -s "import re ; haystack = (' ' * 20000) + 'a'" -c "re.match('\s+$', haystack)"
1 loops, best of 3: 467 usec per loop

    python3 -m timeit -n 1 -r 3 -s "import re ; haystack = (' ' * 20000) + 'a'" -c "re.search('\s+$', haystack)"
1 loops, best of 3: 4.23 sec per loop

Options

    -n  how many times to execute statement
    -r  how many times to repeat the timer
    -s  setup code (run once)
    -c  command to run n times


I guess this must have been the scenario. Perl also defaults to re.search but does not exhibit the pathological case, maybe because it knows it can't find a $ in a block of \s.


By .? you mean .* ?


You are wrong and shouldn't speak authoritatively about things you know little about. Most regex engines, including the one in Perl, do backtrack as described. As a special case, Perl will match backwards from a $ (or in look-behind), in specific cases, but not in the case described, where the $ only appears in one part of an alternation. Similarly, as the other poster noted, most regular expression engines default to 'search' as the primitive, rather than matching only from the start.

So, yes, in a sequence of 20,000 characters not appearing at the end of the string, the typical regex engine will backtrack in a manner exposing O(n^2) performance in this case.

If you're going to test something, use the actual regexes they talk about, and perhaps the language/runtime they talk about.


Perl does a pretty good job at optimizing such things:

    $ time perl -Mre=debug -e '(" " x 100000000 . "X") =~ /\s+$/ && print "MATCH"'
    Compiling REx "\s+$"
    Final program:
       1: PLUS (3)
       2:   POSIXD[\s] (0)
       3: EOL (4)
       4: END (0)
    floating ""$ at 1..9223372036854775807 (checking floating) stclass POSIXD[\s] plus minlen 1 
    Matching REx "\s+$" against "                                                            "...
    Intuit: trying to determine minimum start position...
      Found floating substr ""$ at offset 100000001...
      (multiline anchor test skipped)
      looking for class: start_shift: 1 check_at: 100000001 rx_origin: 0 endpos: 100000001
      Does not contradict STCLASS...
    Intuit: Successfully guessed: match at offset 0
    Matching stclass POSIXD[\s] against "                                                            "... (100000001 bytes)
       0 <> <          >         |  1:PLUS(3)
                                      POSIXD[\s] can match 100000000 times out of 2147483647...
    100000000 <           > <X>       |  3:  EOL(4)
                                        failed...
                                      failed...
    Contradicts stclass... [regexec_flags]
    Match failed
    Freeing REx: "\s+$"
    
    real    0m0.427s
    user    0m0.312s
    sys     0m0.076s
Basically, the regex egine sees a $ anchor, moves to the end of the string, and finds that it can't ever match there.

It makes it quite hard to accidentally trigger such bad regex behavior. See for example https://perlgeek.de/blog-en/perl-tips/in-search-of-an-expone... (disclaimer: my own blog)


> I can't reproduce this "bug" in either Perl or Python.

You're doing it wrong then.

  #!/usr/bin/perl

  use Benchmark;

  sub trim {
      my $s=shift;
      $s =~ s/^[\s\u200c]+|[\s\u200c]+$//g;
      return $s;
  }

  timethis(100, sub { trim("x" . " " x 20000 . "x"); });
With Perl 5.18 I get 2.65/s on a fast iMac.


Is there a difference between greedy and non-greedy atoms?


In the order of searching and hence the match you can get, yes. In performance in the case of a non-match, no.


The regex a?a?a?a?a?aaaaa against the string aaaaa will complete in linear time if you use non-greedy ?s but with greedy matching it has exponential complexity. So, there is a difference.


> So the Regex engine has to perform a “character belongs to a certain character class” check (plus some additional things) 20,000+19,999+19,998+…+3+2+1 = 199,990,000 times, and that takes a while.

199,990,000 isn't really all that many. I'm a little surprised it didn't just cause a momentary blip in performance.

edit: whoops, i guess that's per page load


SO is I/O bound most of the time. If you've set up your system to handle high workloads of I/O bound traffic, then hitting CPU bounds throws a real wrench in your cogs.

To put this another way, SO is one of the most traffic'd sites on the internet. So a page that's loaded 10k+ times a second is going to push that number much, much, higher. If the CPU can't clear 10k+ req in under the regular time it takes, everything starts to back up.


> To put this another way, SO is one of the most traffic'd sites on the internet.

I've seen this said several times here, but never bothered to ask.. by what measure is this true?



Alexa puts it at rank 50 globally. Honestly this seems suspiciously high and I wonder if their sampling method is biased. Regardless, it's a lot of traffic.

http://www.alexa.com/siteinfo/stackoverflow.com


They mentioned that the drop in performance caused the load balancer to remove the servers from rotation. They said that the site would have otherwise been pretty functional. So the resolution to the issue (aside from fixing the regex) was to update the configuration of the loadbalancer/healthcheck to avoid this type of issue.


I presume the strip function happens on every request for the page, which as the homepage for stackoverflow, means 199,990,000 times quite a lot of requests a second.

It's gonna hurt


Per homepage load.


199,990,000 * # of people visiting StackOverflow home page

The result is likely at least 1,000* 199,990,000 at any given point.


The lesson seems to be "Always run trim() before running regex" and "validate content as much as possible before running regex".


Trim would not have worked, the post started with '-- play happy sound for player to enjoy', had 20000 characters of whitespace, and then some other character.


Which also wouldn't have been caught by the regex, because it was designed to do the same as Trim. Given the input, all that whitespace was not actually supposed to be removed and the regex worked, it was just a degenerate case that slowed it down to a crawl. Trim would not slow down on this input.


Trim won't work, otherwise SO would be using trim. The regex did work in all cases, but it was slow under certain conditions. The slowness caused a separate system to shutdown the site.

The regex worked fine, 100% within expectations.


No, the regex was an actual trim. And they have indeed switched to that. The edge case that caused the slow performance was a lot of whitespace that did not appear at the start or end of the string. The regex would not remove that, nor would trim, but trim is much more efficient in doing the exact same thing. The regex may have been correct, but I doubt the expectations of the developer were met and it most definitely was the wrong tool for the job.


Or just reject this input. The server isn't the time and place to do data cleanup.

Can always fix this on the frontend with JavaScript for free if it's an actual problem.


You can't trust that the user has JavaScript enabled; and since you can't trust user input anyway, you'd have to do this server side.


"You can't trust user input" means to not fudge with user data in the first place!

If your code cares about unicode spaces at the beginning of user input, there is your problem.

(Also, you missed the reject part)


How would you suggest running a user-created-content site without ever fudging/sanitizing the data being posted to the server?


Sure you can, but you better clean up on server side, or find a way to prevent any of 7+ billion people from posting maliciously.


> Can always fix this on the frontend with JavaScript for free

Nice to see what people think of offloading work to users' systems. This kind of thinking leads to slow ensures with fucked up scrolling.


This seems like a hard-to-expect edge case for real. I think catching edge case is needed (means more rigorous testing). This is the equivalence of algorithm complexity analysis. How bad can my algorithm be? But regular expression, to be honest, is usually something I hardly think about performance. I don't know about others, but most of the my input are small enough. How big of an input should I test? If I were to deal with a lot of characters, I would be doing substring replacement.


> But regular expression, to be honest, is usually something I hardly think about performance.

This is actually not an uncommon problem. I recently experienced a backend system going down because of catastrophic backtracking. There is a reason why proper regex libraries have a timeout on the match methods.

For example: https://msdn.microsoft.com/en-us/library/hh160204(v=vs.110)....


A time limit on something like this is so extremely awkward to use. My first problem would be that I'm concerned the implementation doesn't have the resolution; a "reasonable" time limit would be what, 1us for a regex match?

Which brings us to the second point: what's this, wall clock time, CPU time? I risk testing this with low load on my dev computer only to have it fail in production because high CPU load means the match just took longer (but wasn't running into a corner case).

I think a reasonable "timeout" would be to give an upper limit for character array accesses, like 20*strlen.


My rephrasing of their follow-up actions:

* "Audit our regular expressions and post validation workflow for any similar issues"

* ==> "Not even people who've worked for years on the guts of regex engines can easily predict the runtime of a given regex, but somehow our engineers will be expected to do that".

* "Add controls to our load balancer to disable the healthcheck – as we believe everything but the home page would have been accessible if it wasn’t for the the health check"

* ==> "Our lb check was checking /index, that failed because /index was slow: Lesson learned, let's not lb check anything at all"


I think an audit on regex usage would be fine in this case because it is silly to be using an expensive regex on the server for every home page display. It should be done when the data item is created, not when it is displayed.

Even if they need to support full round trip back to the originally entered data for editing purposes, they could have an extra column in the table for that purpose only, or they could case the displayed output for the post in something like Redis.

I assume, just like those PHP forums of old, they change so much about the page based on the logged in user they can't actually fully cache the home page for all users.

So an audit should just lead them to remove the regex on display anyway, not try to figure out the run time of it.


Regex engines usually have an option to set upper limit on the time they take to match so they don't end up in expensive loops. A simple solution could be to setup such time limits.


I don't think they're disabling health checks in general, just adding a way to temporarily disable checks when they are misbehaving.


Correct. I'll be adding this functionality into Opserver so that we can override the health check in HAProxy when we know better.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: