"This regular expression has been replaced with a substring function." This shou...

judah · on July 20, 2016

I've fixed so many bugs using regex, only to have to fix several bugs later.

My current stance is, avoid regex if at all possible. Turns out, many of the things we use regex for is possible without. Often times, .Substring, .IndexOf, and using LINQ over strings is sufficient.

mattmanser · on July 20, 2016

Regexes are almost always a massive code smell. They should almost never be used, bad idea, bad implementation, hard to grok, hard to debug, hard to test, hard to spot.

Whoever came up with them has surely been given the same honorary place in hell with Jon Postel, who invented the utterly disastrous "be liberal in what you accept", that has plagued all web developers for the last 20 years.

gnuvince · on July 21, 2016

> Regexes are almost always a massive code smell. They should almost never be used

Regular expressions are just a notation for expressing a finite state automata that recognizes a regular language: if that's the problem that you have, regular expressions are definitely the tool you want to be using. For instance, I recently made myself a small tool to compute the min/max/expected value of a dice roll; the notation for such a thing forms a regular language that can be expressed as follows:

    non_zero_digit ::= '1' | ... | '9'
    digit ::= '0' | non_zero_digit
    integer ::= non_zero_digit digit*
    modifier ::= '+' integer | '-' integer
    roll ::= integer 'd' integer modifier?

Converting this grammar to a regular expression is straight-forward and was the correct tool to use.

I agree that regular expressions are often used in contexts where they should not, especially when people start using back-references to recognize non-regular languages, but don't throw out a perfectly good tool because sometimes it is ill-suited.

DasIch · on July 21, 2016

Regexes are perfectly fine. Awful implementations that accept patterns that aren't regular expressions without complaint and provide little to no tools to look at underlying automata or to step through them during execution are the problem.

It's quite an amazing problem to have honestly because it really shouldn't be a problem to create a proper implementation. You can learn the theory behind regular expression in a few hours and know pretty much everything there is to know.

flukus · on July 21, 2016

How are they hard to test? Given input x, expect output y. It's one of the easiest things in the world to test.

Niksko · on July 21, 2016

The point of this post is that even though the regex behaves correctly (input x produces expected output y), you also need to consider performance constraints.

Without fuzzing, it's going to be pretty difficult to come up with enough test cases to thoroughly test a regex.

qw · on July 22, 2016

Wouldn't the same apply to a custom method? Why would code using a combination of indexOf(), substring() and trim() be any less foolproof on arbitrary data?

mattmanser · on July 21, 2016

Yeah, testing what you expect in regexes is super easy. Edge case testing is not though. Just like exactly what happened in the post this discussion is about..

hinkley · on July 21, 2016

Where do you think you are, right now?

wpears · on July 21, 2016

Where x can be any combination of letters of any length?

pmarreck · on July 21, 2016

This is a load of BS. They are quite capable (and fast) tools in the right hands. And they are easily (and SHOULD be) tested, in any test suite.

As proof, I submit some email header parsing code which I rewrote as a well-commented Regex which was something like 300x faster than using the Mail gem: https://github.com/pmarreck/ruby-snippets/blob/master/header...

Once you know what to look for re: catastrophic backtracking, you know how to avoid it. This is called programmer skill.

wruza · on July 22, 2016

>"be liberal in what you accept"

... is a part of so-called unix philosophy, the CAUSE of the internet. The fact that many web developers forgot to become a programmer is just a fun fact changing nothing.

zamalek · on July 21, 2016

I'll write a trivial state machine over using regex any day of the week. They are easier to write and read (especially if complex), aren't a black box in terms of profiling and don't have surprising performance characteristics like regex does. Unless we're talking about a Thompson NFA (which isn't used by many modern stdlibs) or JS, regexes simply do not appear in my belt of solutions.

lmm · on July 21, 2016

And when you actually need complex parsing, parser combinators can express it in a much more maintainable way.

msoad · on July 20, 2016

"I had a problem, solved it with RegExp, now I have two problems"

StevePerkins · on July 20, 2016

Heh... as a Java developer, my favorite version of that joke is where you solve the problem with Java, and now you have an `AbstractProblemFactoryFactory`.

bigiain · on July 21, 2016

There's quite a nice history to that one:

http://regex.info/blog/2006-09-15/247

It dates back to August 12, 1997!

(Perhaps someone should encourage jwz to have a 20th birthday party for that at DNA?)

AceJohnny2 · on July 20, 2016

    This regular expression has been replaced with a substring function.

God I wish all my bugs were this easy to fix and deploy

meshko · on July 21, 2016

i wish people would stop using regular expressions in situations where they can be replaced with a substring function.

zamalek · on July 21, 2016

Nick explained on Reddit why the regex was used[1]:

> While I can't speak for the original motivation from many moons ago, .Trim() still doesn't trim \u200c. It's useful in most cases, but not the complete strip we need here.

This would have probably been my train of thought (assuming that I consider regex to be a valid solution):

Trim() would have been the correct solution, were it not for that behavior. Substring is therefore the correct solution. Problem is, IndexOf only accepts a char array (not a set of some form, i.e. HashSet). You'd need to write the <Last>IndexOfNonWhitespace methods yourself. Use a regex and make sure that it doesn't backtrace, because it's expressive and regex "is designed to solve this type of problem." The real problem/solution here isn't substring, it's finding where to substring.

I consider regex too dangerous to use in any circumstance, but I can certainly see why someone would find it attractive at first.

[1]: https://www.reddit.com/r/programming/comments/4tt6ce/stack_e...

meshko · on July 21, 2016

Oh totally. I assumed that unicode bs immediately. And anyone would make this mistake easily. That's the point -- gotta have it imprinted in the brains, that regexes are for finding things in files, not for your production code. I've used them myself, but I'd like to think that when i type that regex in i stop and thing whether i will be feeding raw user inputs into it.

Scea91 · on July 21, 2016

Many times regexes are more clear and therefore less bug prone than any non regex alternative. They have their use. Even in production.

meshko · on July 22, 2016

example please

qw · on July 22, 2016

Compressing multiple forms of non-unicode whitespace to single space. Used for cleaning text from input fields that often contains unwanted characters from copy/paste.

The regexp for this is simply \s+

lilbobbytables · on July 20, 2016

What exactly is a substring function, and what makes it different than a regex?

solipsism · on July 21, 2016

My guess is they're searching for the first non-whitespace character, reverse-searching for the last non-whitespace character, and then using String.Substring to return only what's in the middle.

As to why they're not using String.Trim (https://msdn.microsoft.com/en-us/library/t97s7bs3(v=vs.110)....), maybe it's because String.Trim doesn't seem to know about the 200c whitespace character.

dingo_bat · on July 21, 2016

From what I understood, trim would work perfectly. It's 200 spaces not a single 200 width character.

solipsism · on July 21, 2016

You're misreading the regex. \u200c is a single whitespace character. http://www.fileformat.info/info/unicode/char/200c/index.htm

dingo_bat · on July 21, 2016

But that's a weird character to put in a comment line! I don't get how this would happen accidentally.

pyre · on July 21, 2016

You really can't think this way when accepting input from users.

brokenmachine · on July 28, 2016

What difference does it make if it got in there accidentally or on purpose?

Stackoverflow is a programmers site, you must expect that a programmmer might go, "Hmm, they're trimming whitespace, wonder what happens if I put 20,000 unicode whitespace characters in there instead of normal whitespace"?

Piskvorrr · on July 21, 2016

Runaway automatic search-and-replace? There's no way to distinguish intent.

dingo_bat · on July 21, 2016

Runaway search and replace won't put a single 200 width whitespace character AFAICT

Piskvorrr · on July 21, 2016

That's the meaning of "runaway" - Notepad++ had a search&replace that went into a somewhat random, long and uninterruptible loop if you were replacing using some types of regex and searhing forward in the file - you had to search backwards.

dingo_bat · on July 21, 2016

Also, removing spaces from the start and end of a line is a fairly standard operation which is provided by every language's built in library.

curryhoward · on July 21, 2016

Although, many of them do not handle non-ASCII Unicode whitespace characters (which is what the StackOverflow regex was going for).

_pmf_ · on July 21, 2016

Here, the application's notion of whitespace was more comprehensive that the standard library's.