Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

HTML parsing is not that hard compared to CSS/layout/fonts (or even figuring out layout), a competitive JavaScript engine, and the myriad of APIs and site compatibility problems OP talked about.

My HTML parser uses SGML which is more generic as it takes the HTML grammar (a DTD) as parameter and computes state machine tables etc. dynamically based on it, thus a bit harder, but still very much doable.



Does that HTML parser follow all the HTML5 parsing/error-handling rules, so that it conforms to the spec's behavior for random tag soup full of broken markup? Or are you assuming "clean" HTML?


No, it follows the normative description of HTML as specified in chapter 4 of the HTML spec. The redundant procedural spec for parsing HTML is strictly aimed at browser implementers, and in particular to reach same behaviour accross browsers in the presence of errors. Note that the covered fragment still contains the rich tag omission/inference rules for HTML and other minute details, based on formal SGML techniques, though.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: