I've fixed so many bugs using regex, only to have to fix several bugs later.
My current stance is, avoid regex if at all possible. Turns out, many of the things we use regex for is possible without. Often times, .Substring, .IndexOf, and using LINQ over strings is sufficient.
Regexes are almost always a massive code smell. They should almost never be used, bad idea, bad implementation, hard to grok, hard to debug, hard to test, hard to spot.
Whoever came up with them has surely been given the same honorary place in hell with Jon Postel, who invented the utterly disastrous "be liberal in what you accept", that has plagued all web developers for the last 20 years.
> Regexes are almost always a massive code smell. They should almost never be used
Regular expressions are just a notation for expressing a finite state automata that recognizes a regular language: if that's the problem that you have, regular expressions are definitely the tool you want to be using. For instance, I recently made myself a small tool to compute the min/max/expected value of a dice roll; the notation for such a thing forms a regular language that can be expressed as follows:
Converting this grammar to a regular expression is straight-forward and was the correct tool to use.
I agree that regular expressions are often used in contexts where they should not, especially when people start using back-references to recognize non-regular languages, but don't throw out a perfectly good tool because sometimes it is ill-suited.
Regexes are perfectly fine. Awful implementations that accept patterns that aren't regular expressions without complaint and provide little to no tools to look at underlying automata or to step through them during execution are the problem.
It's quite an amazing problem to have honestly because it really shouldn't be a problem to create a proper implementation. You can learn the theory behind regular expression in a few hours and know pretty much everything there is to know.
The point of this post is that even though the regex behaves correctly (input x produces expected output y), you also need to consider performance constraints.
Without fuzzing, it's going to be pretty difficult to come up with enough test cases to thoroughly test a regex.
Wouldn't the same apply to a custom method? Why would code using a combination of indexOf(), substring() and trim() be any less foolproof on arbitrary data?
Yeah, testing what you expect in regexes is super easy. Edge case testing is not though. Just like exactly what happened in the post this discussion is about..
... is a part of so-called unix philosophy, the CAUSE of the internet. The fact that many web developers forgot to become a programmer is just a fun fact changing nothing.
I'll write a trivial state machine over using regex any day of the week. They are easier to write and read (especially if complex), aren't a black box in terms of profiling and don't have surprising performance characteristics like regex does. Unless we're talking about a Thompson NFA (which isn't used by many modern stdlibs) or JS, regexes simply do not appear in my belt of solutions.
Heh... as a Java developer, my favorite version of that joke is where you solve the problem with Java, and now you have an `AbstractProblemFactoryFactory`.
Nick explained on Reddit why the regex was used[1]:
> While I can't speak for the original motivation from many moons ago, .Trim() still doesn't trim \u200c. It's useful in most cases, but not the complete strip we need here.
This would have probably been my train of thought (assuming that I consider regex to be a valid solution):
Trim() would have been the correct solution, were it not for that behavior. Substring is therefore the correct solution. Problem is, IndexOf only accepts a char array (not a set of some form, i.e. HashSet). You'd need to write the <Last>IndexOfNonWhitespace methods yourself. Use a regex and make sure that it doesn't backtrace, because it's expressive and regex "is designed to solve this type of problem." The real problem/solution here isn't substring, it's finding where to substring.
I consider regex too dangerous to use in any circumstance, but I can certainly see why someone would find it attractive at first.
Oh totally. I assumed that unicode bs immediately. And anyone would make this mistake easily. That's the point -- gotta have it imprinted in the brains, that regexes are for finding things in files, not for your production code.
I've used them myself, but I'd like to think that when i type that regex in i stop and thing whether i will be feeding raw user inputs into it.
Compressing multiple forms of non-unicode whitespace to single space. Used for cleaning text from input fields that often contains unwanted characters from copy/paste.
My guess is they're searching for the first non-whitespace character, reverse-searching for the last non-whitespace character, and then using String.Substring to return only what's in the middle.
What difference does it make if it got in there accidentally or on purpose?
Stackoverflow is a programmers site, you must expect that a programmmer might go, "Hmm, they're trimming whitespace, wonder what happens if I put 20,000 unicode whitespace characters in there instead of normal whitespace"?
That's the meaning of "runaway" - Notepad++ had a search&replace that went into a somewhat random, long and uninterruptible loop if you were replacing using some types of regex and searhing forward in the file - you had to search backwards.
This should be the title of a book on software engineering.