Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"This regular expression has been replaced with a substring function."

This should be the title of a book on software engineering.



I've fixed so many bugs using regex, only to have to fix several bugs later.

My current stance is, avoid regex if at all possible. Turns out, many of the things we use regex for is possible without. Often times, .Substring, .IndexOf, and using LINQ over strings is sufficient.


Regexes are almost always a massive code smell. They should almost never be used, bad idea, bad implementation, hard to grok, hard to debug, hard to test, hard to spot.

Whoever came up with them has surely been given the same honorary place in hell with Jon Postel, who invented the utterly disastrous "be liberal in what you accept", that has plagued all web developers for the last 20 years.


> Regexes are almost always a massive code smell. They should almost never be used

Regular expressions are just a notation for expressing a finite state automata that recognizes a regular language: if that's the problem that you have, regular expressions are definitely the tool you want to be using. For instance, I recently made myself a small tool to compute the min/max/expected value of a dice roll; the notation for such a thing forms a regular language that can be expressed as follows:

    non_zero_digit ::= '1' | ... | '9'
    digit ::= '0' | non_zero_digit
    integer ::= non_zero_digit digit*
    modifier ::= '+' integer | '-' integer
    roll ::= integer 'd' integer modifier?
    
Converting this grammar to a regular expression is straight-forward and was the correct tool to use.

I agree that regular expressions are often used in contexts where they should not, especially when people start using back-references to recognize non-regular languages, but don't throw out a perfectly good tool because sometimes it is ill-suited.


Regexes are perfectly fine. Awful implementations that accept patterns that aren't regular expressions without complaint and provide little to no tools to look at underlying automata or to step through them during execution are the problem.

It's quite an amazing problem to have honestly because it really shouldn't be a problem to create a proper implementation. You can learn the theory behind regular expression in a few hours and know pretty much everything there is to know.


How are they hard to test? Given input x, expect output y. It's one of the easiest things in the world to test.


The point of this post is that even though the regex behaves correctly (input x produces expected output y), you also need to consider performance constraints.

Without fuzzing, it's going to be pretty difficult to come up with enough test cases to thoroughly test a regex.


Wouldn't the same apply to a custom method? Why would code using a combination of indexOf(), substring() and trim() be any less foolproof on arbitrary data?


Yeah, testing what you expect in regexes is super easy. Edge case testing is not though. Just like exactly what happened in the post this discussion is about..


Where do you think you are, right now?


Where x can be any combination of letters of any length?


This is a load of BS. They are quite capable (and fast) tools in the right hands. And they are easily (and SHOULD be) tested, in any test suite.

As proof, I submit some email header parsing code which I rewrote as a well-commented Regex which was something like 300x faster than using the Mail gem: https://github.com/pmarreck/ruby-snippets/blob/master/header...

Once you know what to look for re: catastrophic backtracking, you know how to avoid it. This is called programmer skill.


>"be liberal in what you accept"

... is a part of so-called unix philosophy, the CAUSE of the internet. The fact that many web developers forgot to become a programmer is just a fun fact changing nothing.


I'll write a trivial state machine over using regex any day of the week. They are easier to write and read (especially if complex), aren't a black box in terms of profiling and don't have surprising performance characteristics like regex does. Unless we're talking about a Thompson NFA (which isn't used by many modern stdlibs) or JS, regexes simply do not appear in my belt of solutions.


And when you actually need complex parsing, parser combinators can express it in a much more maintainable way.


"I had a problem, solved it with RegExp, now I have two problems"


Heh... as a Java developer, my favorite version of that joke is where you solve the problem with Java, and now you have an `AbstractProblemFactoryFactory`.


There's quite a nice history to that one:

http://regex.info/blog/2006-09-15/247

It dates back to August 12, 1997!

(Perhaps someone should encourage jwz to have a 20th birthday party for that at DNA?)


    This regular expression has been replaced with a substring function.
God I wish all my bugs were this easy to fix and deploy


i wish people would stop using regular expressions in situations where they can be replaced with a substring function.


Nick explained on Reddit why the regex was used[1]:

> While I can't speak for the original motivation from many moons ago, .Trim() still doesn't trim \u200c. It's useful in most cases, but not the complete strip we need here.

This would have probably been my train of thought (assuming that I consider regex to be a valid solution):

Trim() would have been the correct solution, were it not for that behavior. Substring is therefore the correct solution. Problem is, IndexOf only accepts a char array (not a set of some form, i.e. HashSet). You'd need to write the <Last>IndexOfNonWhitespace methods yourself. Use a regex and make sure that it doesn't backtrace, because it's expressive and regex "is designed to solve this type of problem." The real problem/solution here isn't substring, it's finding where to substring.

I consider regex too dangerous to use in any circumstance, but I can certainly see why someone would find it attractive at first.

[1]: https://www.reddit.com/r/programming/comments/4tt6ce/stack_e...


Oh totally. I assumed that unicode bs immediately. And anyone would make this mistake easily. That's the point -- gotta have it imprinted in the brains, that regexes are for finding things in files, not for your production code. I've used them myself, but I'd like to think that when i type that regex in i stop and thing whether i will be feeding raw user inputs into it.


Many times regexes are more clear and therefore less bug prone than any non regex alternative. They have their use. Even in production.


example please


Compressing multiple forms of non-unicode whitespace to single space. Used for cleaning text from input fields that often contains unwanted characters from copy/paste.

The regexp for this is simply \s+


What exactly is a substring function, and what makes it different than a regex?


My guess is they're searching for the first non-whitespace character, reverse-searching for the last non-whitespace character, and then using String.Substring to return only what's in the middle.

As to why they're not using String.Trim (https://msdn.microsoft.com/en-us/library/t97s7bs3(v=vs.110)....), maybe it's because String.Trim doesn't seem to know about the 200c whitespace character.


From what I understood, trim would work perfectly. It's 200 spaces not a single 200 width character.


You're misreading the regex. \u200c is a single whitespace character. http://www.fileformat.info/info/unicode/char/200c/index.htm


But that's a weird character to put in a comment line! I don't get how this would happen accidentally.


You really can't think this way when accepting input from users.


What difference does it make if it got in there accidentally or on purpose?

Stackoverflow is a programmers site, you must expect that a programmmer might go, "Hmm, they're trimming whitespace, wonder what happens if I put 20,000 unicode whitespace characters in there instead of normal whitespace"?


Runaway automatic search-and-replace? There's no way to distinguish intent.


Runaway search and replace won't put a single 200 width whitespace character AFAICT


That's the meaning of "runaway" - Notepad++ had a search&replace that went into a somewhat random, long and uninterruptible loop if you were replacing using some types of regex and searhing forward in the file - you had to search backwards.


Also, removing spaces from the start and end of a line is a fairly standard operation which is provided by every language's built in library.


Although, many of them do not handle non-ASCII Unicode whitespace characters (which is what the StackOverflow regex was going for).


Here, the application's notion of whitespace was more comprehensive that the standard library's.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: