Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Would be nice to have a regex for parsing HTML...

grabs popcorn



Easy with a sufficiently powerful engine: https://stackoverflow.com/a/4234491

Relies on ?(DEFINE): http://p3rl.org/perlre#(DEFINE)


There is a good comment on that answer:

> To sum up: RegEx's are misnamed. I think it's a shame, but it won't change. Compatible 'RegEx' engines are not allowed to reject non-regular languages. They therefore cannot be implemented correctly with only Finte State Machines. The powerful concepts around computational classes do not apply. Use of RegEx's does not ensure O(n) execution time. The advantages of RegEx's are terse syntax and the implied domain of character recognition. To me, this is a slow moving train wreck, impossible to look away, but with horrible consequences unfolding


With subroutines and recursive patterns I think you could do something parsing valid HTML.

Your sanity won't be left intact tho.


how about this "match "A B C" where A+B=C"[1] for sanity?

[1] http://www.drregex.com/2018/11/how-to-match-b-c-where-abc-be...


boom. https://regex101.com/r/PxSY4U/1 technically it does parse it. :P


Nope. <h1 class="foo>bar">My First Heading</h1> will misparse. (This is valid HTML 5.) You really need recursive regex or something equivalent in power, otherwise you will always fail.


Well yea, it's a joke...


Haha..careful. someone might take this seriously




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: